The March Madness NCAA Men’s Basketball Tournament is my favorite sporting event of the year. Part of the enjoyment comes from competing in pools with friends, family, and co-workers. I’ve also used it as an excuse to learn more about data science by experimenting with machine learning techniques over the last couple of years.
For my Data Visualization class (DS 745), we needed to iterate and improve a visualization. The subject of the visualization was up to the student. Since this class happened about this time last year, March Madness was an obvious choice for me.
The goal of this project was to provide insights based on the only information that many people have when filling out their bracket – the seeds of the teams in each region. My speculation was that the nature of the matchups would result in certain seeds performing better over time. In addition to the sum of wins for each seed, I wanted to emphasize Final Four appearances and championships since these are commonly emphasized in pools.
We spent a fair amount of time in the class reading Edward Tufte’s intriguing work on visualization. Inspired by this work, the following criteria was the guidance for these plots:
“Create displays of evidence for making decisions that give the viewer the greatest number of ideas with the least ink in the smallest space.”
Data and Tools
Fortunately, my experience with March Madness completions on Kaggle provided me with a dataset (https://www.kaggle.com/c/march-machine-learning-mania-2017/data) that could be manipulated to provide the required information. A short R script manipulated this data into the format that I wanted, which was a 2,016 row CSV file with details on each game won in the tourney from 1985 to 2016.
The two tools that I used were R and Power BI. In addition to the data manipulation, R was used for the Box Plot in the first iteration. I initially did this in R and then used the embedded R in Power BI capability to centralize the visualizations.
Iteration 1 – Box Plot Using Embedded R in Power BI
Box plots are a great way to quickly compare the distribution of different categories. They show 5 data points per category along with outliers. In this iteration, I chose to show the distribution of the number of wins per seed per year. To interpret the plot, the #1 seed box indicates that the median wins per year is 13 for all #1 seeds (the line in the box). The typical range (25th percentile to 75th percentile – the box) is between 12 and 16 wins. On occasion, #1 seeds have won as few as 8 total games and as many as 19 games (the whiskers).
A quick review of the following plot will show you that #1 seeds dominate when looking at wins per year, seeds 5 and 6 are identical, and number 9 seeds perform worse than surrounding seeds. Seeds 8 and 10 have each had an outlier year where they won significantly more than other years.
The downside to box plots is they are unfamiliar to most.
Iteration 2 – Bar charts
I needed multiple charts to visualize the information with bar charts – total tourney wins by seed, final four appearances by seed, and tourney championships by seed. If your goal is to pick the tourney champion, this view has more information (hint: chose a #1 seed!).
Iteration 3 – Line chart
Iteration 3 uses a line chart to maximize the amount of information shown in a single graph. It essentially combines the three bar charts showing total tourney wins as well as progress in the tourney. You can see the #1 seeds have the most wins in every round of the tourney, as expected.
Power BI’s interactive filtering can simplify the graph down to interesting subsets.
Seeds 8 and 9 perform worse in the second round of the tourney than seeds 10, 11, and 12. This is because the brackets force the winner of the 8 vs. 9 first round game to face a number 1 seed in the second round. However, if the number 8 seed (purple line) beats the number 1 seed, they have a chance to advance a long way in the tourney.
I hope this gives you a bit more insight when picking your bracket. That said, picking based on team mascot or colors seems to be as good a route as others most years!! Good luck!
Picture Details: “Minnesota Spring Break”, March 11, 2019, Canon PowerShot SD4000 IS, F4.5, 1/800 s, ISO-160