Deep learning and horse race prediction #1

7 min readNov 27, 2017

Horse racing prediction was one of my agendas for long time.
Actually this was the first thing I thought I wanted to try after studying deep learning for a while.
But I didn’t try it because of my time limitation.

I mean everyone wanted to train your own model with real data instead of MNIST dataset once you learn deep learning.

I picked horse racing because a couple of my friends are horse racing lovers and the data was obtainable.

First, I have checked a few researches online and I got a few of them.

Predicting Horse Racing Result
One students team seemed to try it. They scraped data from Hong Kong Jockey Club. They seemed to use linear model and beat public intelligence ( Closing odds Model )
AMAZON MACHINE LEARNING: HACKING HORSE RACING FOR PROFIT
The author purchased 30 days of previous history which contains around 4,000 individual race results from Equibase and input that data into Amazon Machine Learning service. He said the model came close to a 75% win rate.

I was excited because the student team used pretty simple data structure and simple linear model which I don’t think require Tensorflow. The article of Amazon service prepared only 4,000 records of data which was absolutely too small for deep learning.

What if I train a no linear model with 10x more data?

Preparing 10x amount of data may be time consuming but defining s no linear model sounds easy.

Boy, I was wrong…

Data of images has a tensor structure by definition. Data of language can be structured into array or tensor structure.

But the data structure of horse racing is completely different from other data structure I knew.

It’s fundamentally complicated and unstructured.
It’s more like graph instead of table.

There are even more challenges…

How many horses participate in a race is variable

Sometime 10 horses compete in a race, but sometimes 16 horses do.
This means the number of output categories vary if I use winner as target labels.

Most horses run only 15 to 30 races in life

Some horses run 60 times in life. But even 60 is not big enough number for statistics or deep learning. This is kind of like sentences which contains only extremely rate words. I can not convert each horses into numbers.

idea

I read one paper called Single-Image Depth Perception in the Wild.

This paper demonstrated that AI model could predict which point was closer in depth between 2 points in a photo by using 1 photo and 2 points as input and “which is closer in depth” as target.

And because I was inspired by the paper I decided to use the time difference between 2 horses as target labels.

My plan was create model which predict a winner horse between 2 horses and predict ultimate winner from the predictions.

By using 2 horses model, I don’t need to care how many horses participate in a race any more, which is good.

Data

I still needed to decide data structure.

Most other attempts used statistical data like the winning rate of horse, winning rate of jockey, and winning rate of trainer as input.

This sounds okay but since I want to try something new, I have decided to include the data of 5 past races, which was horses’ track records.

Winning rate of jockey is the percentage that the jockey finished in rank 1 or 2.
Winning rate of trainer is the percentage that the horse of trainer finished in rank 1 or 2.

Some blocks have meanings but because I don’t know how to create data structure which reflect that meanings, I put horse A’s data and horse B’s data in a row and put the distance of upcoming race at first.

So out of entire 105 features,
1st data is about upcoming race,
2nd to 54th is about horse A,
and 55th to 105th is about horse B.

Here’s a data sample.

{
  odds: 544, // 5.44
  age: 4,
  weight of jockey: 590, // 59.0 kg
  weight of horse: 444, // 444 kg
  change in weight: -2,
  winning rate of trainer : 0.133,
  winning rate of jockey : 0.143,
  time difference from top: 16, // 1.6 seconds
  speed metric for last 600m: 332, // 332 seconds
  rank outcome: 7,
  race distance: 1600, // 1600 m
}

I purchased 63 years of data, which is from 1954 to 2017, from JRA-VAN, which sells Japanese horse racing data.

This contains 107,423 races and 1,422,268 results of each horses.

I prepared a script to calculate jockeys’ and trainers’ winning rates at specific time and saved all combination of horses for all races into Mongodb.

This process took almost a week with my Azure DS1_V2 Standard instance ( 1 vCPU, 3.5 GB memory) maybe because I wrote the script with Javascript and I didn’t optimize for speed …

Anyway, I prepared 9,245,046 records of data.

model 1

Since the data was not time sequential or images, I created a simple 3 layers deep learning model with Keras/Tensorflow.

# define model
model_1 = Sequential()

model_1.add(Dense(128, input_shape=(105,), activation=None))
model_1.add(LeakyReLU(alpha=0.3))
# model_1.add(Dropout(0.2))

model_1.add(Dense(256, activation=None))
model_1.add(LeakyReLU(alpha=0.3))
# model_1.add(Dropout(0.2))

model_1.add(Dense(128, activation=None))
model_1.add(LeakyReLU(alpha=0.3))
# model_1.add(Dropout(0.2))

model_1.add(Dense(1, activation=None))

model_1.compile(optimizer='rmsprop',
              loss='mean_absolute_error',
              metrics=[metrics.mae])

model_1.summary()# training model

# add checkpointer
save_model_name = "keiba_model_g1.h5"
checkpointer = ModelCheckpoint(filepath='results/'+save_model_name, verbose=0)

model_1.fit(x=x_train, 
            y=y_train, 
            batch_size=64, 
            epochs=5, 
            verbose=1, 
            callbacks=[checkpointer],
            validation_split=0.2,
            shuffle=True)

The result was train_mean_absolute_error: 0.1609, val_mean_absolute_error: 0.1584.

Since I chose float value as target labels instead of binary value, there is no accuracy thing. I had no idea what 0.1584 meant. Is this good or bad?

So I prepared some test script which can tell whether prediction predicted winner or not.

If the prediction is negative number and the target label is also negative, AI predict horse A lose correctly.

Or if the prediction is positive number and the target label is also positive, AI predict horse A win correctly.

y_mean_value # mean value is stored in Mongodb
y_std_value # std value is stored in Mongodb

y1 = y_test * y_std_value + y_mean_value

pred_normalized = model_1.predict(x_train)
pred1 = pred_normalized * y_std_value + y_mean_value

# multiply prediction and actuall value. if sing is same the result should be positive
check_1 = pred1 * y1

idx = 0
for item in check_1:
  if item * y_train[idx] > 0:
    check_1[idx] = 1
  else:
    check_1[idx] = 0

idx = idx + 1accuracy1 = 100*np.sum(check_1) / len(check_1)print("accuracy1:{}".format(accuracy1))

The results was about 40%.

40%?

I was bit worried because loss was not decreasing but less than 40 %?

This is even worse than flipping a coin.

I may made some bugs but this showed clearly my model did not learn anything.

I tried batch normalization and dropout and a few other optimizations but nothing worked. Since the feature count is only 105, I could not come up with better model than simple 3 layers model.

I needed to tweak data structure.

model 2

Finally I have decided to use binary value as target Y.

1 is win, 0 is lose.

# model 2:
model_2 = Sequential()

model_2.add(Dense(128, 
                  input_shape=(105,), 
                  activation=None))
model_2.add(LeakyReLU(alpha=0.3))

model_2.add(Dense(256, activation=None))
model_2.add(LeakyReLU(alpha=0.3))

model_2.add(Dense(128, activation=None))
model_2.add(LeakyReLU(alpha=0.3))

model_2.add(Dense(1, activation=None))
model_2.add(Activation('sigmoid'))

model_2.compile(optimizer=Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0),
              loss='binary_crossentropy',
              metrics=[metrics.binary_accuracy])

model_2.summary()

I added only sigmoid activation to make the number between 0 to 1.

The results was train_binary_accuracy: 0.6794, val_binary_accuracy: 0.6832.

here’s the iPython notebook.

This was close to 70% but this is for picking a winner out of 2 horses.
Because there are so many obvious cases for humans, this was also worse than other simper attempts.

Conclusion

I prepared 9,245,046 records of data and used Keras to create 3 layers deep learning model.

But result was completely fiasco.

May be because …

My script has bugs and data is broken
Or horse racing is product of probability and impossible to predict with deep learning or at least current deep learning.
Or data structure I chose was wrong.
But I don’t know how to handle graph like structure of data.
Or I need to predict winner of race or profit instead of a winner in 2 horses.
Or something else…

I will try horse racing prediction in the future again after my ability improve.

But please give me hints on comments if you have any.

I will try your idea and share the result in my future blog.

Thanks.