Scraping web page tables with R

March Madness is coming to a close this weekend. It’s been a great tournament so far and has again demonstrated why it’s the best sporting event around! I have one more March Madness related post to share before I suppose I must move on as well.

A few years ago, an internal Microsoft competition drove me to create an ML model for picking tournament games. Kaggle also runs an annual competition and I throw a flyer at that most years. While looking for some historical data that provided potential features, I came across Ken Pomeroy and his site KenPom.com. If you don’t know Ken’s back story of going from a meteorologist to an highly sought after college basketball analytics expert, check out https://en.wikipedia.org/wiki/Ken_Pomeroy. It’s pretty interesting!

On the KenPom.com home page, there is table of per team advanced metrics gong back to 2002. All 353 D1 teams are there with about 20 columns of data… not something that you are going to manually transfer. Fortunately, I came across some R code that could scrape data from HTML tables and push it into a data frame. The code that aggregates these tables along with a Year column into a single CSV file is shown below.

library(XML)
library(rvest)
library(stringr)
library(dplyr)

kenPomData <- data.frame()
inputPath <- "2019Data\\" # Location for storing the data to be used as an input to the ML work.

# For training purposes, get all data before the current year. Aggregate into a single data frame.
for (year in 2002:2018)
{
    kenPomYear <- getKenPomYearData(year)
    kenPomData <- rbind(kenPomData, kenPomYear)
}

write.csv(kenPomData, paste0(inputPath, "KenPomDataRaw.csv"), row.names=F)

# For the features needed for predicting the current year, get just the current year's data.
kenPom2019 <- getKenPomYearData(2019)
write.csv(kenPom2019, paste0(inputPath, "KenPomDataRaw2019.csv"), row.names=F)

getKenPomYearData <- function(year)
{
    theUrl <- paste0("https://kenpom.com/index.php?y=", as.character(year))
    page <- read_html(theUrl)
    tables <- page %>% html_nodes("table") %>% html_table()
    data <- as.data.frame(tables[1])

    colnames(data) <- c("Rank", "Team", "Conf", "Record", "AdjEM", "AdjO", 
         "AdjO_R", "AdjD", "AdjD_R", "AdjT", "AdjT_R",
         "Luck", "Luck_R", "SoS_AdjEM", "SoS_AdjEM_R", 
         "OppO", "OppO_R", "OppD", "OppD_R", "NC_AdjEM", "NC_AdjEM_R")

    data <- data %>% filter(!str_detect(Rank, "Rk")) # Remove label row

    data <- data %>% filter(nchar(as.character(Rank)) > 0) # Remove empty rank rows.
    data$Year = year

    return(data)
}

My work to prep the data for ML training is not done after acquiring the data, however. I have another 80 lines of R code that does some munging of the KenPom data and joins it with a Kaggle data file that has the results for tournaments going back to 1985. This has some typical data prep challenges of parsing strings, converting types, and joining data sets on string keys that don’t come from the same origin (“N. Dakota State” vs. “NDSU” vs. “North Dakota State University”, for example). Once that is done, the ML training and/or prediction work can begin.

I’ll save describing the ML work for next year’s tourney. I hope you enjoyed March Madness and learned a thing or two about using R and Power BI along the way through these posts. Thanks for reading!

Picture Details: “Spring-ing”, April 7. 2019, Canon PowerShot SD4000 IS, f/4, 1/1000 s, ISO-125