The recently completed March Madness tournament (NCAA Division 1 Men’s College Basketball) has long been the favorite sporting event for this somewhat sports obsessed fan. Participating in pools for the tournament leads to more engagement in the games, bringing in new teams to cheer for and games to follow. The tourney is three extended weekends of upsets, exciting finishes, and Cinderella stories. This year was certainly no exception.
For as much as I like March Madness, I don’t pay much attention to the basketball season leading up to the tourney. Other than following the Minnesota Gophers and the NDSU Bison, I don’t watch games nor pay much attention to the polls. So when Selection Sunday happens and I need to make my pool selections, I’m usually at a loss for how to go about it.
That changed a few years ago when I decided to feed my data science learning by building a machine learning model to predict the tournament. Fortunately, a Microsoft internal competition and the annual Kaggle.com competition provided the fuel and some key structure. This post discussed a few takeaways from my annual March Madness ML pursuits.
Takeaway 1 – the problem set up and available data are essential
While there are many different March Madness pool formats, each requires the player to pick the winners of games between two teams. Since you don’t know how the tournament will play out, you need to predict a lot of games before the tournament starts.
The annual Kaggle.com March Madness competition solves this by requiring you to predict the outcome of every team against every other team in the tournament. This competition also requires you to provide a probability of winning for each team. Since binary classification problems fundamentally provide this probability, that’s an easy thing to do in addition to choosing the winning team.
Kaggle also provides a solid base of historical tournament data. That’s a great starting point for creating the input data for an ML model, especially the ‘labels’ for which team won going back multiple decades.
The next challenge for me was to find additional data that could provide tournament insight that wasn’t aligned with the panel of experts that do the tournament seeding. There’s no chance that I have insight that the experts don’t have, so how can I do better than just picking the better seed for each game? Fortunately I stumbled across Ken Pomeroy’s advanced metrics at 2021 Pomeroy College Basketball Ratings (kenpom.com). The stack rank of teams on his site didn’t necessarily align with the seeds, so I determined that these could be a complementary source. Equally important, his site has historical data going back to 2002 that I could easily scrape.
Takeaway 2 – the “80/20 cleaning rule” rules
One of the rules of thumb for data science is that 80% of the work is done in data cleaning and preparation. Even with only two data sources, this was the truth in my scenario. I have the largest chunk of code in this ML problem dedicated to things like parsing strings to extract key numbers like seed and win count, whittling down the noise that I didn’t want in my training set, and matching up university names so that the data could join properly (N Dakota State vs North Dakota St, for example).
Once I got to the data in the right format, I derived new features for the training. These aren’t complex; for example, I use the difference in seed between the two teams rather than the individual team seed. But it adds to the prep code base.
Takeaway 3 – estimates in, estimates out
It’s well understood that the predictions that come out of an ML algorithm are estimates. It’s also important to consider that the feature inputs may also be estimates. Sure, there are historical facts like the number of wins a team had in the season. But the seeding process is an estimation process based on the expert panel. The advanced metrics are ultimately estimates that are meant to measure a team’s quality.
The Big 10 was supposedly the best conference in the NCAA this year. The seeding process reflected this and Ken Pomeroy’s metrics supported it. Since these were the key inputs to my ML process, the predictions also were skewed to the Big 10.
It turned out that the experts and their estimates were wrong. The Big 10 teams didn’t perform well in the tourney. And neither did my model in many cases. My model deemed Ohio State as not only a likely Final Four competitor but the likely tourney winner. Ohio State lost to a #15 seed in the first round. I guess that estimate was wrong!
This takeaway is a polite way to say garbage in, garbage out. It’s important to recognize where your inputs are estimates and your model will only be as good as those estimates are in a given year.
Takeaway 4 – collaborative intelligence is the goal
As a general rule, I have a strong preference for simple models with quality inputs over more complex models with broader inputs. A primary reason for this is that domain experts have an easier time reasoning about the model. My ultimate goal is collaborative intelligence – a combination of artificial and human intelligence. I’ll give you a couple of examples for this ML project.
My first pass at the model a few years ago had a #14 seed doing very well in the tournament, beating most of its head to head matchups despite its low seed. Since a seed this low has never performed that well, I was skeptical. Fortunately, my relatively simple model made it easy for me to identify the issue. This team had won nearly all of its games and its advanced metrics were quite good. However, it’s schedule was weak and my model didn’t account for that. Since KenPom.com also includes a strength of schedule metric, I added that as a feature to my model and that #14 seed went back to its appropriate place.
In the lead up to this year’s tournament, the algorithm in my test set with the best predicted accuracy had two things that didn’t seem quite right. First, the tourney’s prohibitive favorite, Gonzaga, was predicted to lose quite a few games. Second, a team’s offensive efficiency metrics (per KenPom.com) was of negligible importance for the model (explaining Gonzaga’s predicted performance). Because this didn’t seem quite right, I chose a different algorithm that had slightly lower accuracy but better alignment with my expectations for feature importance and game outcomes. While there were still challenges with this model, I’m confident it performed better than the ‘best’ model.
But does it help??
Given my limited knowledge of teams at the start of March Madness, having an ML solution that makes predictions based on historical results with a set of advanced metrics is very helpful. It doesn’t mean that I’m winning the pools that I play in, but I’m usually competitive and the ML predictions provide another interesting angle to an already very interesting event! Even though the 2021 model misfired on the Big 10 teams, it did help me identify some lower seeds that helped keep me competitive.
The model that I’ve been using is basic and certainly could be improved with more investment. That said, it’s a helpful tool in my toolbox and these takeaways align quite well with other ML work. A key affirmation is the collaborative intelligence goal… more on this in future posts.
Picture details: July 4, 2020, Apple iPhone 7, f1.8, 1/613 s, ISO-20