Learning Data Science through Kaggle

Arb

One of the best things that the two MOOC classes that I wrote about in my last post exposed me to was Kaggle.com. Kaggle is a web site for data science. It is primarily used for competitions, some of which are learning focused while others are industry competitions where winners get significant payouts and/or jobs.

Each of the two MOOC classes required a Kaggle project. If I remember correctly, one of the classes had a ‘private’ project for the class and the other one required you to do one of the handful of open projects on the site. Hands on practice with feedback was a big benefit of these Kaggle projects.

A typical pattern for Kaggle projects is the following:

  • Training phase
    • The project provides a training dataset for you to download to use in a machine learning (ML) model.
    • You submit your ML model to the site and a test dataset is executed against your model. Your results are then scored based on how well your model made predictions and you are placed appropriately on a leader board.
  • Execution phase
    • When the training phase completes, you submit one or more final models.
    • During the execution phase, there is either ‘live’ data that is used against your model or there is another data set that has been held back from the training phase.
    • As in the training phase, your results place you on a leader board. Winners earn pride, money, or jobs!

For example, I’m a fan of the NCAA men’s basketball tournament and will compete in the annual Kaggle competition. During the training phase, Kaggle provides historical tournament and team data. I believe the test dataset during the training phase is last year’s tournament (I typically don’t submit during the training phase).

Once the current year tournament seeding is done, the site provides a dataset showing all possible game combinations. Before the tournament starts, each competitor needs to submit that dataset with a probability of a win for ‘Team 1’ for each possible game. The scoring is interesting because it rewards you for your confidence in a winning team, where confidence is indicated by a high probability for the game. But it also punishes you for being confident in a team and it loses.

Can you become proficient at data science using Kaggle.com as the primary education tool? For me, it was an excellent supplementary tool and one that could be used to advance my skills.  Combined with the MOOC coursework, it was very helpful… give it a try!

Picture detail:  October 27, 2018 – University of Minnesota Arboretum, Chanhassen, MN – iPhone 7, f1.8, 1/120 s