**Every student of statistics I know has at least once thought about making easy money by predicting the stock market or by predicting sports results. To be honest, I certainly was not an exception. Intimidated by**

**uncapped randomness of the stock market, I always tended more to the second option – the sports betting. Nevertheless, it has not been until recently, that I asked the guys from the team and we decided to actually really try**

**if we can make an easy living from predicting football results. [For impatient readers: it looks promising but it is not so easy]**

In the beginning we knew absolutely nothing about how the betting works, where to find the data, neither what some standard prediction methods are. So let me walk you through what we have learned.

### Football betting

The most common bet in football is so called 1×2 – decoded as win of home team, tie and win of away team. For a match, bookmakers present odds for each of the possible outcomes. These odds reflect probabilities of each outcome as seen by the bookmaker plus some small margin making the bookie’s living. Because we used my account at Bwin we were using odds in a decimal format with the following relation to implied probability: probability = 1/odds (so odds 2 means 50%, 2.5 means 40% and so on). The game is very easy if I place a bet on an outcome with odds 2.5 and I win, I get 2.5-times the bet. And zero otherwise. Tempting, right?

### Statistical approach

The simplest way to pick the right bet is by using the expected value. It works as follows: when I place a bet of 100 on a win of the home team with odds being 4.0 and I estimate the probability that the home team actually wins as 0.3 the expected value of my bet is 120 (100 * 4.0 * 0.3). This is good because 120 is bigger than 100 so I am actually winning some money. The term expected value is a bit tricky because in fact I don’t expect to win 120, I rather expect to either win 400 or lose 100. But statistically in a long run, making hundreds of similar bets – I expect to win 120 per match. But that’s not the point here.

So now the task can be broken down into three parts:

- how to estimate the probability of each outcome
- how to decide how much money to place (100 or rather 10?)
- and at what threshold should I place the bet (is 120 as an expected value of a 100 bet enough?).

First we need some historical data.

### Where to get the data

One of the first links in our rather unsophisticated search for “football data” was http://www.football-data.co.uk/, which turned out to be just perfect for our needs. It offers weekly updated fixtures, results, odds and many other match statistics for all major European leagues for up to past 20 seasons.

The files are nicely organised, so it is very convenient to get data for specified leagues and seasons. For example the following code gets data from current season (2016/17) of the Premier League. Please note that all the coding was done in Python 3.5.

import posixpath import pandas as pd root_url = 'http://www.football-data.co.uk/mmz4281/' season = '1617' league = 'E0' # Premiere League data_file_suffix = '.csv' remote_file = posixpath.join(root_url, season, '{0}{1}'.format(league, data_file_suffix)) data = pd.read_csv(remote_file) print(data.shape)

### Predicting match results

Once we had data, we could focus on building a predictive model for match outcomes. One of the first models for this was developed by Maher [1]. Imagine a match played by team* i* as home and team *j* as away. Maher modelled a number of goals scored by home team as a variable with Poisson distribution (X_{ij}) and similarly for the number of goals scored by away team (Y_{ij}). Each of the variable is influenced by different parameters (𝜆 and 𝜇) and they are also assumed to be independent.

Number of goals scored by home team (X_{ij}) is driven by attacking strength of the home team α_{i}, defensive strength of the away team β_{j,} and some home field advantage 𝛾. On the other hand, the number of goals scored by the away team (Y_{ij}) is driven by attacking strength of the away team (α_{j}) and the defensive strength of the home team (β_{i}). Or in other words by a nice formula:

P(X_{ij} = x, Y_{ij} = y|𝜆, 𝜇) = Poisson(x|𝜆) · Poisson(y|𝜇),

where 𝜆 = 𝛾 · α_{i}β_{j} and 𝜇 = α_{j}β_{i}.

Practically it means that we take some relevant historical results and based on them we estimate attacking strength α and defense strength β for each team and also a general home field advantage 𝛾 using maximum likelihood.

Maher’s model was further improved by Dixon and Coles [2]. Firstly they realised that in reality the low score ties (0-0, 1-1) happen more often and the results 0-1 and 1-0 happen less often than they should according to Maher’s model. To correct for that they introduced a function 𝜏(x, y, 𝜆, 𝜇, 𝜌) defined as follows. Note that we get Maher’s model for 𝜌=0.

def tau_function(x, y, lambdaa, mu, rho): if x == 0 and y == 0: tau = 1 - lambdaa * mu * rho elif x == 0 and y == 1: tau = 1 + lambdaa * rho elif x == 1 and y == 0: tau = 1 + mu * rho elif x == 1 and y == 1: tau = 1 - rho else: tau = 1 return tau

Second improvement was that they considered unrealistic that the attacking and defensive strength do not evolve in time so they introduced a time-variant version of the parameters α_{it} and β_{it}. Technically they used exponential forgetting so that the recent results are more important for the model fitting than the historical ones. How quickly the importance of historical results decreases or in other words how quickly they are forgotten depends on a parameter 𝜀. And again we get Maher’s time-invariant model for 𝜀=0.

import numpy as np def get_time_weights(dates, epsilon): delta_days = [(max(dates) - d).days for d in dates] # future games not relevant return list(map(lambda x: 0 if x < 0 else np.exp(-1 * epsilon * x), delta_days))

When we put all of that together we are about to calculate maximum likelihood estimates for parameters of the following model.

P(X_{ij} = x, Y_{ij} = y|𝜆, 𝜇, 𝜌) = 𝜏(x, y, 𝜆, 𝜇, 𝜌) · Poisson(x|𝜆) · Poisson(y|𝜇)

where 𝜆 = 𝛾 · α_{i}β_{j} and 𝜇 = α_{j}β_{i} and all α’s and β’s vary in time. Because we don’t want to optimise the product of exponential functions we will use log-likelihood function and because it’s easier to minimise than maximise (using `scipy.optimize.minimize`

) we will find the estimates by minimising the negative log-likelihood function.

The log-likelihood function looks something like this:

def time_ln_likelihood(values): return sum([(value['time_weights'] * (np.log(tau_function(value['home_goals'], value['away_goals'], value['lambda'], value['mu'], value['rho'])) + (-value['lambda']) + value['home_goals'] * np.log(value['lambda']) + (-value['mu']) + value['away_goals'] * np.log(value['mu']))) for value in values])

Before the optimization we need to make sure the model will be identified by adding the following constraint. The identification condition establishes that the log-likelihood has a unique global maximum.

def norm_alphas(params, number_of_teams): return sum(params[:number_of_teams]) / number_of_teams - 1

And once that is done we can run the optimization, where we address teams by IDs and not names and x0 is an array with the initial set of parameter estimates.

# optimize minimize_result = op.minimize(fun=lambda *args: -time_ln_likelihood(*args), x0=x0, args=(input_data_frame['HomeId'].tolist(), # home teams input_data_frame['AwayId'].tolist(), # away teams input_data_frame['FTHG'].tolist(), # home goals input_data_frame['FTAG'].tolist(), # away goals number_of_teams, get_time_weights(input_data_frame['Date'].tolist())), constraints=({'type': 'eq', 'fun': lambda *args: norm_alphas(*args), 'args': [number_of_teams]}))

This will give us all the parameters needed for our model so we can easily calculate the probability of each possible score for any future match. If we have a matrix `game_probabilities`

of all possible match results (with goals of home team in rows and goals of away team in columns) then the following simple aggregation gives us the probabilities of our 1×2 outcomes.

win_probability = sum(sum(np.tril(game_probabilities, -1))) # triangle-lower for home win draw_probability = game_probabilities.trace() # diagonal for draw loss_probability = sum(sum(np.triu(game_probabilities, 1))) # triangle-upper for home loss

### Betting strategy

When I mentioned the expected value approach to betting let me stress out that a good betting strategy is not about correctly predicting the match results. It is about finding opportunities where our probability is higher than the bookie’s probability implied by the odds. By definition these opportunities are typically present in outcomes that have low probability. So we do not expect to bet on favourites where the odds are very low but rather on outsiders.

How much money to bet and when to actually place the bet is called a betting strategy. We decided to try three different methods to calculate the size of a bet:

- Fixed: betting a fixed amount of money scale_constant on each match
- Kelly’s [3]: bankroll * ((probability * odds – 1) / (odds – 1)), where bankroll is simply the overall budget, which reflects the profit and losses over time
- Variance-adjusted: min(bankroll, scale_constant / (2 * odds * (1 – probability)))

The threshold for placing a bet shows how risky bets we want to make. The more bets we make the closer we should theoretically get to the expected value. And because the number of accepted bets declines with the growing threshold the final payoff becomes more volatile.

### What works best

At this stage we have all the building blocks needed so the next step is to give it a try. We decided to test our solution on the Premier League because it was the first data set on http://www.football-data.co.uk/. To determine, which prediction method and which betting strategy should be used we run a grid search, which simply tried all specified parameter settings and stored how well they performed. The grid consisted of following 4 800 combinations.

MODEL_FITTING_GRID = {"method":['Dixon-Coles', 'Maher'], "epsilon": append(arange(.0005, .00225, .00025).tolist(), [0.0])} BETTING_GRID = {"betting_strategy": ['fixed_bet', 'Kelly', 'variance_adjusted'], "bet_threshold": arange(1.05, 1.55, .005)}

We tested this grid using a rolling-prediction over season 2015/2016 when taking data back to 2013 and considering also The Championship (second highest league in England) to include also matches of teams that had been promoted or relegated in time. The parameter we wanted to maximize is a simple return on investment (ROI) so in our case (total payoffs – total bets) / total bets.

The winning combination turned out to be time variant Dixen-Coles model with epsilon equal to 0.002, Kelly’s betting strategy and bet threshold of 1.1. Truth be told the bet threshold was higher but we wanted to place more bets.

So how is this combination working in the current season? (click the image to open interactive Tableau dashboard)

We can see that we started off really well but then got hit by a series of unsuccessful rounds when we lost almost 50% of our initial budget. Few recent rounds have put us back in the game but still way far from “easily making money”.

When I mentioned this exercise to my college friends, who have been meddling with sports betting, they told me that 1×2 bets on Premier League on Bwin is the hard mode of sports betting. So if you really want to give it a try don’t follow us and try to find some niche league and bet for the number of goals for instance.

It the light of that information we tried our approach on few other leagues, namely German Bundesliga, French Ligue 1, Belgian Jupiler League and Portuguese Primeira Liga. Again we run the grid search over the past and chose the combination of parameters with the highest ROI and with a reasonable number of bets.

The best combination for Bundesliga was time-variant Maher’s model with epsilon equal to 0.00075, Kelly’s betting strategy and bet threshold 1.47. This had a ROI of 0.21. And the performance in the current season so far?

For the Ligue 1 we chose time-variant Maher’s model with epsilon being 0.00175, Kelly’s betting strategy and threshold of 1.41 with a 0.54 ROI in season 2015/2016.

For Belgian Jupiler League we decided for time-variant Maher’s model with epsilon being 0.00175, Kelly’s betting strategy and threshold of 1.44 with 0.19 ROI in season 2015/2016.

And finally the Primeira Liga with time-variant Maher, epsilon of 0.00225, Kelly’s betting strategy and threshold of 1.11, which was losing the least in 2015/2016 – ROI of -0.09.

### Final thoughts

As we can see the models can win some money even though with a huge volatility. That is positive news given we only tried the basic statistic approaches. In the future one might play with some machine learning approach, add more predictors into the model – e.g. weather or even data for individual players.

On the other hand we haven’t dealt with automatically placing the bets in higher volumes, which is often a very tricky part in this business if you don’t want to spend hours by making bets.

So all-in-all despite all the fun we had with this we’ll probably stick to our job at the moment. What about you, won’t you give it a try? 🙂

Big thanks goes to my friends Martin and Michal from aLook Analytics who turned my simple Python scripts into a fully automated betting tool.

I would also like to mention three great online sources of information and data:

- Historical football data http://www.football-data.co.uk/
- Short paper by H. Langseth with all important knowledge https://www.idi.ntnu.no/~helgel/papers/LangsethSCAI14.pdf
- Great blog about sports betting and many more http://opisthokonta.net/

### References

[1] Mike J. Maher. Modelling association football scores. Statistica Neerlandica, 36(3):109–118, 1982

[2] Mark J. Dixon and Stuart G. Coles. Modeling association football scores and inefficiencies in the football betting market. Applied Statistics, 46:265–280, 1997

[3] John L. Kelly. A new interpretation of information rate. IRE Transactions on Information Theory, 2(3):185–189, 1956

[4] Designed by Freepik

Jean-JacquesHey congrats about this exciting article!

It seems that the model you have implemented does not take into account the home field advantage. How could you integrate this into the code? I tried to use your code to predict some football scores but there are some parts I missed (about the x0).

Do you have published the full code on Github or any similar platform?

Thanks!

LikeLike

adamvotavaPost authorHi,

thanks for your interest!

Home field advantage is included in the model. See “Number of goals scored by home team (Xij) is driven by attacking strength of the home team αi, defensive strength of the away team βj, and some home field advantage 𝛾.”

x0 is just an array with initial parameter estimates for the optimisation. You can for example start with

`x0 = [1 for _ in range(3*number_of_teams + 1)]`

, i.e. all parameters equal to 1. See scipy.optimize.minimize.Let me know if that helped.

Adam

LikeLike

Jean-JacquesThanks you for this clarification. I read an other paper which helps me a lot in the implementation (http://projekter.aau.dk/projekter/files/14466581/AssessingTheNumberOfGoalsInSoccerMatches.pdf).

I still have several questions about the log-likelihood function. Regarding how the minimization function works, sometimes the input values can generate a negative tau weight. Thus, the log is not defined in this case. How do you deal with this?

Btw, did you try to tune the x0 vector in order to fit better the function? What about the time weight function?

Many thanks for you answer!

LikeLike

adamvotavaPost authorHi Jean-Jacques,

So far I’ve not run into tau function returning some negative number so I’m not dealing with it in any way :). Also I use the last element of x0 (rho) set to .03, contrary to what I wrote in my earlier comment (sorry for that). Maybe that’s why it doesn’t get negative.

The role of x0 is just to give the optimisation a starting point. So if you’d be smarter about it you can get the optimisation to converge faster.

I believe that the time weight function can be improved. I simply used a standard exponential forgetting as a starting point.

Good luck with your model. Let me know how it goes :).

A.

LikeLike

Jean-JacquesOk it’s very strange because for games with low score (0:0 for example), I can have negative values for tau. I’ll investigate on this. On some others articles, I read that exponentiating the variables can help the solver. Are you doing sorcery like this? 🙂

On the code snippet you gave, you put a reference to value[‘lambda’] but you did not mentionned how lambda is computed. Is it only the product of 𝛾, αi and βj, the contained in the array argument “sent” by op.minimize?

Thanks again for you precious reply, I really help me a lot.

LikeLike

adamvotavaPost authorYes, 𝜆 = 𝛾 · αiβj and 𝜇 = αjβi

LikeLike