The aim of this article is to use a dataset of horse racing in Hong Kong to predict the winner or the Top 3 (placed) of each race. Our goal is to see how much money we will win or lose, based on predictions from different approaches.We will go through data preprocessing, visualization and modeling using pandas, scikit-learn and Tensorflow.

April 28, 2022

*Written by** **Koffi Cornelis** & **Idrissa Ndiaye*

The aim of this article is to use a dataset of horse racing in Hong Kong to **predict the winner** or the **Top 3** (placed) of each race. Our goal is to see **how much money we will win or lose, **based on predictions from** **different approaches.

We will go through data preprocessing, visualization and modeling using pandas, scikit-learn and Tensorflow.

The dataset comes from Kaggle and covers races in Hong Kong from 1997 to 2005.

The data consists of **6,348 races** with **4,405 runners**.

The 5878 races ran before January 2005 are used to develop the forecasting models whereas the remaining 470 races (ran after January 2005) are preserved to conduct out-of-sample testing.

The original dataset contains relevant features:

**race_id**— unique identifier for the race**date**— date of the race, YYYY-MM-DD**venue**— a 2-character string, representing which of the 2 venues the race took place: ST = Shatin, HV = Happy Valley**distance**— distance of the race, in meters**surface**— a number representing the type of race track surface:

1 = dirt, 0 = turf**horse_id**— unique identifier for the horse**horse_age**— current age of the horse at the time of the race**result**— finishing position of the horse**declared_weight**— declared weight of the horse and jockey, in lbs**actual_weight**— actual weight carried by the horse, in lbs**draw**— post position number of the horse in this race

These features define the characteristics of a horse for a particular race.

The full list of all the variables can be found here.

There are of course some important variables which lead to a better winning rate. We have selected 3 among them to display their link to the winning rate:

**draw position****age of the horse****odds (or favorite horse)**

One thing we have already detected is the “** draw advantage**”. Horses with a draw position ranging from 1 to 3 are 3% more likely to win the race compared to horses on draw position 10 and above. The smaller the draw position, the lesser distance ran since the horse is closer to the inner curve.

We can see below a tangible example to explain draw position. The racers at position 1 will run less distance than the one at position 6 for example.

When finishing time is normalized by distance, we can observe the same trend that the lower the draw, the quicker the horse finishes as we can see in the following graph.

The horse age has some effect on the winning probability.

Here, appart from horses aged 9 and above, we can conclude that horses aged 3-4 are more likely to win.

*Win odds reflect the public intelligence on the expectation of horses in a single race. In **Pari-mutuel betting system**, the larger the win pool of a horse, the lower its winning odd.*

Unsurprisingly, the favorite horses (odds between 1-2) have a 50% increased chance of winning compared to outsiders.

Unfortunately we will demonstrate later that betting on the lowest rate (“win_odds”) will result in negative net gain in many cases.

Now that we have explored data, we will build baseline models thanks to those insights in order to try to bet accurately.

After this introduction to the dataset, let’s dive into our real goal which is betting and see if we can win some money. For this part, we will only use the out of sample testing (races after 2005) as they are more recent.

First thing to know is that there are a lot of ways of making money by betting in a horse race. We will only focus on two of them:

- betting on the winner
- betting on placed horses

Betting on the winner is pretty straightforward: before the start of a race, you pick a horse and bet money that it will win. If it loses, well, you sadly lost your betting but if it wins you win big. You receive the amount of money you bet multiplied by the “win_odds” of this horse for this race.

Now, betting on placed horses is when you pick a horse or several horses, and bet that those picks will be **placed in the top 3**. In this case, you win the amount of money you bet multiplied by the “place_odds” of this or those horses for this race. It is far easier to win a bet with a placed horse than betting on the winner, hence place_odds are always lower than win_odds.

The number of picked horses depends on the amount of runners:

- Less than 7 runners : up to 2 horses picked for top 2
- More than 7 runners : up to 3 horses picked for top 3

Now, we will try to see if without machine learning we can have a positive profit when we bet $1 for each bet done. We will confront those two types of betting with 3 different very basic approaches:

- Betting randomly
- Betting on the favorite
- Betting according to the draw position

The focus will be on bets where we try to predict the winner. For this type of betting, we will bet on the **470 different races** in the test set because we have win_odds available for all those races. Profit will be used to assess the efficiency of the model.

**Random**… Well, it means we have just arrived to the race, we pick a random number on the departure line and we expect the selected horse to win. So, by applying these random guesses for all the races we had in our data, we obtain some results.

So, we have bet $1 on each race, picking the horse randomly : in total we have bet on 470 horses (1 per race).

We won only 17 bets with a total revenue of $266.4 but still a negative profit of $-203.6.

Let’s try another approach: betting on the favorite horse.

Who never agreed with the majority? The race’s odds are set a long time before the race takes place and can vary until it begins. These odds are set by bookmakers. Those persons are expert and will not give high odds to a horse with a lot of chance of winning.

Here, for each race we pick the horse with the lowest “**win_odds”** as the favorite horse.

We can see below the result for all 470 bets we did.

So it was better than picking a random horse, meaning favorite horse is really a thing: we won 1 bet over 3, but it’s not enough to have a positive profit ($-61.7).

Now let’s focus on the draw. The **draw** is the horse position at the beginning of the race.

Here, we have bet on the horse placed at the 2nd draw position when the race starts.

Still with our 470 bets, we have a negative profit. However, the winning rate bet is slightly higher than the one we had with the random method: draw position appears to be an effective variable to determine the winner.

We will apply the same models to placed horses. As explained previously, for each race, we will bet on 2 to 3 horses depending on the race. Furthermore, we will not compute our gain with “win_odds”, but with “place_odds”, which corresponds to the odds for betting on placed horses.

We will still use the test set but with less data: we keep only data where the “place_odds” is available. Here, we bet on 249 races resulting in 746 bets for placed horse in total. The worst scenario is to lose $2 or $3 by race and the best one is to win 2 or 3 different “place_odds”.

Without any knowledge about horses, we pick **randomly** 2 or 3 horses by race and if they finish in the top 2 or the top 3, we receive the “place_odds” for each winning bet.

We win more bets of course because we bet more than previously and the rate is also higher. The profit is still negative ($-152.7).

Now we focus on favorites horses, meaning horses with the **lowest “place_odds”**. Let’s see how do we/they perform.

Even with almost 50% of the bet won, it’s not enough to overpass our investment and the profit is negative: $-151.5.

For this approach, we are going to pick 2 or 3 horses with the lowest draw:

- If 3 horses: draw position 1, 2 and 3
- If 2 horses: draw position 1 and 2

We can see that we win a bit more bets than the random method but the profit is still negative ($-174.6).

None of our approches made us win money. Is it really a surprise … If it were known, I am sure there would be more Jeff Bezos and Warren Buffett on Earth.

We can see below a table which summarizes all 3 baseline models with the two types of bets: winner or placed.

A very basic method does not allow us to guess the winner or even placed horse correctly; I mean it in a way you can win some money.

However, this basic approach was necessary before diving into machine learning. It will allow us to compare results to see if it is better than a naive approach. Those models will be used as a references in comparison to machine learning models that we will develop in a second part of this use case.

To conclude this first part, we have seen that none of the simplistic models leads to good results. However, the favorite approach gives the least bad revenue. Moreover, we have noticed that some variables have greater impact on the winning rate. In part 2 of this article, we are going to compare these methods to the ability of AI Machine Learning Algorithms.

# Plus de posts

Notre Manifeste est le garant des droits et devoirs de chaque CodeWorker et des engagements que nous avons vis-à-vis de lui.

Il se veut réaliste, partagé et inscrit dans une démarche d'amélioration continue.

Tu veux partager tes connaissances et ton temps avec des pairs empathiques, incarner une vision commune de l'excellence logicielle et participer activement à un modèle d'entreprise alternatif, rejoins-nous.