Aggregating spreadsheets with R

20170221_143828688_iOS

While I wouldn’t call myself a power user, I know my way around Excel. Or at least I used to.

When I started learning about data science about 4 years ago, the first book I bought was John Foreman’s Data Smart: Using Data Science to Transform Information Into Insight ( http://www.john-foreman.com/data-smart-book.html). A primary reason was that most of the explanation was done using Excel, an environment that I knew.

Excel use tends to grow organically.  It starts with using it to enter or collect some data.  Then you have a need to add a formula or two to sum up some numbers.  Then you want to add in more information so you add more sheets with vlookup() formulas to bring the information together.  And then you need to visualize the data so you build a few charts.  Before you know it, you have a file with 17 sheets and a myriad of formulas with implicit dependencies.  And there’s likely only one person in the world who understand how the whole mess works.

Today, Excel has a much narrower set of uses for me in my toolkit. For data manipulation, I use R. For visualization, I go for Power BI. Excel still gets some use for data input and some interactive exploration, but R and Power BI are much better if code or visuals are involved. I will walk you through a two part example which demonstrates why this is the case.

March Madness

I run a unique NCAA March Madness pool each year. Instead of picking the entire bracket, players choose 8 teams that can score points for them with wins throughout the tournament. The scoring system results in more points for wins in later rounds and for lower seeds. Each seed has a price tag associated with it and each player has a budget of a mythical $50 to spend. The system forces each player to come up with a strategy on how to pick the teams given the price tags and budget. Here are my picks for this year:

NCAASelections

It’s not a huge pool as there have been 13 players in the past couple of years. The spreadsheet shown above is sent to the prospective players after Selection Sunday.  Excel works great for this as I’ve created a template with a few formulas to look up the valid teams for the tourney, copy in the Seed and the Price, and then count the teams and the total Price.  The spreadsheets are returned to me via e-mail before the first game tips off.

Now I could open all 13 spreadsheets and copy the content to an aggregated spreadsheet, but I also could write a few lines of R code to pull this data in.  The code I use is shown below.  In summary, it does three things:

  1. Find all the Excel files in a particular folder.
  2. Loop over these files, extract the table shown above as a data frame, and append each individual data frame to an aggregate data frame.
  3. Write the aggregated data out to a comma separated value (CSV) file.
inputPath <- "2019\\"

filenames <- list.files(paste0(inputPath,"Submissions"), pattern="*.xlsx", full.names=T)

# Initialize vectors and create a dataframe for the table.
Submission.Your.Name <- character()
Submission.Team <- character()
Submission.Seed <- numeric()
Submission.Price <- numeric()

all <- data.frame(Submission.Your.Name, Submission.Team, Submission.Seed, Submission.Price, Submission.Region)

# Loop over files, read excel files, extract info, update all. 
for (i in 1:length(filenames))
{
   filename <- filenames[i]
   print(paste("Processing file", filename)) 
   sheetNames <- readxl::excel_sheets(filename)
   
   # There may be spreadsheets that aren't Submissions in the folder.
   if (sheetNames[1] != "Submission") next        
   
   # Read info from all sheets.  I don't need to do this here, but left it as an FYI.
   sheets <-    lapply(sheetNames, function(X) readxl::read_excel(filename, sheet = X))  
   names(sheets) <- sheetNames
   
   # Populate the summary table with a temporary data frame and append it to all. 
   submissionSheet <- as.data.frame(sheets[1])
   thisSubmission <- submissionSheet[1:8, 1:4] 
   
   all <- rbind(all, thisSubmission)
}

# Checks on team count and submission amount.
table(all$Submission.Your.Name)
tapply(all$Submission.Price, all$Submission.Your.Name, sum)

# Write out result.
write.csv(all, paste0(inputPath, "2019AllSubmissions.csv"), row.names = F)

This effort may not be worth it for 13 files with a single sheet of contiguous cells, but it gets a lot more interesting with many files and many sheets.  For those in the Dynamics 365 F&O ISV ecosystem, I used this approach to aggregate the CARs (Customization Analysis Reports) that are part of the AppSource curation process.  Each CAR was generated by a tool that the ISV needed to run on each model in the solution.  The report provided information on overlayering (back when you could overlayer!), extensions, best practices, etc. This exercise was repeated for each major upgrade.

Over multiple iterations, we had the need to process nearly 1,000 spreadsheets to gather data from several worksheets in each report.  The 200 lines of R code that I wrote for this paid off in a major way as there was no reasonable manual way to aggregate this information.

In my next post, I’ll describe how I use the output from this exercise along with the results from each game to produce a Power BI report for the pool.

Picture Details:  Park Rapids, 2/21/2017, iPhone 7, f1.8, 1/1916 s, ISO-20