In the first part of this article, we demonstrated that using intuitive approaches for horse racing prediction such as picking randomly the winner or the favorite will render a negative profit in the long term. Our motivation is to find an edge and generate profitable models. The purpose of this second part is to apply an approach based on machine learning and deep learning.
Written by Idrissa Ndiaye & Koffi Cornelis
In the first part of this article, we demonstrated that using intuitive approaches for horse racing prediction such as picking randomly the winner or the favorite will render a negative profit in the long term. Our motivation is to find an edge and generate profitable models.
The purpose of this second part is to apply an approach based on machine learning and deep learning.
With some horse racing knowledge, we first created useful features (Feature Engineering) to improve the performance of machine learning models.
Then, we created our training set continuously to benefit from horses’ recent results.
Finally, we developed a tree based model (LightGBM) and a Neural Network to evaluate profit/loss for each month.
In this article we will focus on ensemble models.
Here is a short list of tech stack used for this project :
This use case about horse racing will obviously use some machine learning algorithms. Benefiting from our GCP certifications, we used GCP for all our machine learning development.
We used Cloud Storage to store our data, notebooks from AI Platform to write our code and Compute Engine to create and run our virtual machine.
To reproduce the results, here is a link to our GitHub repository with source codes.
This part will show how we handle the features creation using a data pipeline with python. This will also explain our approach and how we create, train and test our datasets.
The initial dataset comes from Kaggle and covers races in Hong Kong from 1997 to 2005. It consists of 6,348 races with 4,405 runners. Initially, we had 104 different features, but since we had access to the date of the race, we could create a lot more features.
The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.
— Luca Massaron
We created features that measured :
We ended-up with 1 618 features after this step. We used a complete pipeline with several functions for multiple features. We called this python file extract_features and you can find it here.
As explained before, we need to compare ML results with our previous baseline models which can be found here. So, we need to compare the exact same races to conclude which of the baselines and ML models bring best profits.
Our test set goes from January 2005 to August 2005.
We will create different models based on incremental periods of time so that test cases are more accurate. For each month, we will have a model and we will combine their results to compare with our baseline models.
The graph below shows the number of races available that we will use for training and testing.
Set 1 :
Set 2 :
This continues until we reach the eighth set.
Just remember that set #4 doesn’t exist because there are no races in April 2005.
This second part will briefly explain how bettings work for horse racing, then focus on our model based approach, and our metrics.
As explained in the first part on this use case (here), there are two ways of betting. We can bet for each run on the winner or on placed horses.
If it is not very clear, feel free to check here for a full explanation.
For both approaches, winner and placed bet, we will use two different ways of prediction:
We proceed as follows:
For example if the prediction for a race is:
Well, in this case we chose to focus on two things:
The goal of this model is to improve the profit we had before with our baseline models (part 1).
On this part, we will only focus on winner bets.
This section will be split between explanations and results:
First we define our hyperparameter searched space : the range of values available for our optimizer to select from. We use the TPE algorithm implemented in HyperOpt.
Then we evaluate several parameters on multiple rounds with our profit metrics.
To find the best parameters, we use our first set of data. We save all those rounds in a CSV file and we retrieve the best parameters from it.
Here is the final best parameters for the LGBM where the profit metrics indicates $27.4 for set #1.
For the lightGBM, we used the parameters we defined earlier with the hyperparameter optimization. We trained all ours LGBM models for all sets keeping always the same optimized parameters (best parameters from set #1). Once a model is trained, we save it.
We will focus on the first set where the test set is January 2005.
We can see below the evolution of our profit with an initial investment of $100. After 20 bets, we have around $20 of profit but after the 70th bet, the profit is around $-13.
Here are the result for the set #1 with LGBM:
The last two lines show that we are not necessarily always betting on the favorite horse. Those two insights will stay for other analysis.
The features importance for LGBM models for January is displayed below. Features related to “win_odds” and “declared_weight” appear quite often.
We used the default feature importance calculation: result contains the number of times the feature is used in the model.
First we train our deep learning models for each set. We only choose one layer of 96 neurons, 40 epochs and a batch size of 100. The image below shows the architecture of the network.
We saved the model for each set. The figure below shows deep learning results for January 2005 test set.
Here are the result for the set #1 with deep learning:
Important note : The average win odds for this model is “twice” the one of all favorite horses.
This literally means we earn twice the amount we bet with the deep learning model than betting on favorite horses.
To obtain our final result, we take all models from both algorithms and we predict the winner for each set and each race.
The table below shows :
Here are the final result:
The ensemble model is very useful here because the profit is higher for the consolidated model than either one of the two other models. However, the percentage of success is lower for the consolidated model than the LGBM.
In this next part, we will only focus on placed horses.
We will continue to use the sets-based approach with all 470 same races as before.
Now instead, we will bet 2 or 3 times for each race depending on the number of racers as explained in the previous part here. This corresponds to 746 bets.
For placed bets we discard all races where placed odds are unavailable. This lack of data decrease the amount of runs by sets (no races considered for the last two sets).
For the placed approach, we decided not to train another LGBM model, we will keep the “winner” models with the same optimization.
We will predict the probabilities for each class and select 2 or 3 horses depending on the number of runners.
For this type of bet, we don’t take the odds related to a win (“win_odds”) but the ones related to a placed horse (“placed_odds”) . Obviously, place_odds is lower than win_odds since it’s easier to predict a placed horse than a winner.
We can see below the graph of the evolution of our profit for the second set corresponding to February 2005.
For deep learning, we decided to train new models. We use the same neural network but instead of having one class to predict the winner for each race, we now have 2 or 3 to predict placed horses. We use those new models to predict placed horses for all our sets.
We can see below results for February 2005.
This month specifically, the model detected outsider horses (cf the last two lines of the table). The mean win odds is 10.48 !!!
That’s why our profit is so high comparing to other months.
We proceeded the exact same way we did before and we can see below the final table with all results for placed bets. Results below only concern the first 5 sets because last 2 don’t have any races available due to the missing “place_odds”.
We can see that only the second set with February have a positive profit and by far because it compensates all loses.
Here is a comparison of all percentages:
Ensemble model is less useful here because the profit is lower for the consolidated model than either one of both other models. However, ensemble model attenuates important losses where deep learning and LGBM were less profitable (set #1 and #3) . Furthermore, the percentage of success is lower for the consolidated model than it is for LGBM but higher than it is for the deep learning model.
The goal was initially to compare baseline models from part 1 with our ensemble model from this part. The table below shows a comparison between them on various factors :
We can easily conclude that machine learning helps to have better profit compared to some basics ideas from baseline models.
We can also notice that our results with ensemble models seem consistant. Even if we win less bets than when we used the favorite methods (cf. winning_rate_bet on the table above), bets we won are on non-favorite horses, which are more profitable.
However, a question might torment you, how can we win $336.7 on placed horses, with a positive profit for only one set ? We agree this is kinda strange but sometimes the art of betting doesn’t only rely on statistics and features.
Luck remains the most important feature and we guess we had some luck there.
The GitHub code is available for GCP but feel free to ask questions we will be happy to help.
GitHub horse racing prediction repository
Notre Manifeste est le garant des droits et devoirs de chaque CodeWorker et des engagements que nous avons vis-à-vis de lui.
Il se veut réaliste, partagé et inscrit dans une démarche d'amélioration continue.
Tu veux partager tes connaissances et ton temps avec des pairs empathiques, incarner une vision commune de l'excellence logicielle et participer activement à un modèle d'entreprise alternatif, rejoins-nous.