Let’s talk about win probability. Along with metrics/graphics like attacking threat, momentum tracker, it is a part of a whole new slew of analytics-driven content added to the Premier League broadcast experience. In case you aren’t aware what win probability is, it is the set of percentage numbers showing the likelihood of three possible outcomes (W/L/D).
I personally like them a lot. To the average viewer, they’re understandable, intuitive, and easy to digest.
In this blog post, we will look at how to build one of those models. Our job is made easier by the fact that there’s already some existing public work on the topic - 1,2,3 by StatsPerform, KU Leuven, and American Soccer Analysis (ASA) respectively. For this blog post, we will try to implement the one by ASA. The methodology (explained by Tyler Richardett) is detailed while also being simple to implement and understand. If you haven’t already, I’d suggest reading through the blog post at least once before coming back here.
The data used was the entire of the 2017/18 Premier League season from the public figshare wyscout data. We need the event data to calculate some metrics like the game flow and the pre-match team strength metrics (both calculated using “expected threat” or xT).
I broke down the code into three separate notebooks. They are meant to be run in sequence. The first two notebooks take care of the preprocessing and feature-building parts while the third one contains the modelling and simulation bits.
Before diving into the code, a quick little detour to discuss win probability to ensure we are all on the same page. Win probability is basically described as the expected probability of either team winning or a draw at any given game state or even before the match begins.
Our intuition tells us that a few different factors affect this win probability - like current scoreline, team strengths, the flow of the game, minutes remaining, etc.
Using these factors, we want to estimate the probability of both teams scoring in the remaining minutes. Once we have those, we can use these values to simulate goals, tally them with the current scoreline and get our W/L/D probabilities. This is best explained using an example (picked up directly from the blog):
“…let’s say FC Tucson leads the Richmond Kickers 2–1 with 40 minutes of play remaining. And based on a number of features detailed below, our model tells us that Tucson has a constant 0.7% probability of scoring across each of those 40 minutes, and that Richmond has a 3.8% probability. So, we flip a weighted coin for 40 trials over 10,000 simulated games, tally those respective results, and find that Tucson wins 2,394 times, the two teams draw another 2,701 times, and Richmond wins 4,905 times. In turn, we predict that the Kickers have a 49% chance of overcoming the 2–1 deficit and winning the game.”
Here’s the full list of features that we will use to fit our model:
Steps to cover:
The data is all in .json format. Load it into a pandas dataframe.
Separate out the tags dictionary into a usable column.
Use tags to find markers like goals, red cards, yellow cards, to get our game state features.
Get xT values for our successful passes.
To get our expected threat values, we will be using the raw xt values shared by Karun Singh. The data is already in the github repository if you’re cloning the repo.
We need xT for three metrics - avg_team_xt, avg_opp_xt, and running_xt_differential. It is also important to know that the wyscout data only has passes and not carries so we will only have xT values for passes (it is possible to impute carries from the data by looking at the difference between successive events but for the first run, I decided to skip that).
For the game state metrics - like red cards or goals - we need to use the wyscout tags. The tag ids and their descriptions are in the tags2name.csv file from the extracted data. For example, 1702 is a yellow card, 101 is a goal, 102 is an own goal and so on.
We will loop over all the matches, perform the steps from above, and save them all as .csv files in our processed-data folder.
Main steps to cover:
Aggregate n-previous xt values for both teams from every match to get our pre-match team strength metrics.
Calculate rolling xT difference per minute (game flow proxy).
Get remaining minutes and time intervals.
Derive the target - probability of scoring in the remaining minutes.
In this notebook, we will use the processed csv file for each match and build the remaining features. For our team strength indicators, we will take an average of the xt accumulated by the teams in their last 4 matches. To do this, an efficient (if not the most elegant) way is to first loop over all matches, calculate the xt accumulated by both teams and then save the result in a JSON file.
This is what was done in the following code cell.
After this, we again loop over all the matches, and build the remaining features. Note that the time intervals column here comes from the blog post itself where Tyler has divided each match into 100 total intervals - this helps to deal with the variable gametime for each match.
Our target variable is the probability of scoring the goals in the remaining time intervals. I had initially only used a binary outcome variable (whether a goal was scored or not after this stage). However upon asking Tyler, I realized that his way of generating the targets was slightly different but much better - the targets would just be the number of goals remaining to be scored divided by the number of intervals left.
To illustrate using an example, if we are in the time interval 60/100, and we know FC Tucson are going to score another 2 goals in the remaining time intervals, the probability at that point would be 2/40 or 0.05. If they score 1 at 65, then at 70/100 it would be 0.034 (1/30).
(Per my earlier understanding, the binary target variable at time interval 60 would be 1 - indicating that “yes, they do score a goal after this stage” and would continue to be 1 until they have scored both their remaining goals, at which stage it would become 0 - indicating that no further goals were scored after that interval)
After finishing all our steps for all our matches, we saved the data as one large .csv file and proceed onto the exciting stuff - modelling.
Main steps to cover:
Group the data from the previous notebook so that we have a single observation per minute.
Split the data and fit the model.
Some evaluation graphs and a sanity check plot.
The blog uses a single observation per minute of the match. Our final data from the previous notebook has multiple records per minute so our first step is to perform a pandas groupby and condense the data. After performing that, we have around 60k observations in total.
At this point we are ready to fit a first model. We batch out our train and test splits and fit a regression model.
Notice that the time interval column is not a part of the feature set - the model predictions seem to be more consistent without it. I suspect that’s because of the target leakage and that ends up hindering the model’s ability to learn useful relations. For this reason, I decided to remove it from the feature set. It wasn’t present in ASA’s model either. At any rate, we will really need it immediately after to simulate the final outcomes.
For a given match, once we have a goal-scoring probability for each time interval, we can “flip a weighted coin and tally the results”. Once we get our tallies for all the time intervals, we can draw a stack plot to visually get an idea of the model’s performance.
The methodology has some limitations, at the moment. I’ve listed them below, as well as potential improvement ideas.
Figuring out a better validation and evaluation strategy
One of the things I glossed over was the evaluation metric used in the blog - Ranked Probability Score (RPS). It is ideal for evaluating models predicting match results where the possible outcomes can only be win, draw, or loss. While I was successful in implementing the loss function, I couldn’t manage to use it as a custom function while training the model. It is important to note that without this, we have no real way of knowing if the model we just trained is good or bad besides simply plotting the graphs from above and visually checking. LightGBM uses l2 loss by default for regression but it is not ideal for our case.
Training on more data
Also, we only trained the model on one season of the Premier League. Training the model on more seasons, and more leagues, is likely to be the best way of improving it. The wyscout dataset contains more leagues, and seasons, so this is doable. Tyler told me that they used a total of 7500 matches to train their final model - so it is very likely there’s some model performance that we’re leaving on the table by training on only 340 matches!
Getting xT for carries or using an action-valuing model like VAEP or g+
One final way of improving the model performance is using g+ or some other action-valuing model instead of a ball progression model like xT. The team strength indicators are very important to the model’s performance as noted by the feature importance plot from above so using a more granular action-valuing model might further tune the model predictions.
Huge thanks to Tyler Richardett himself for his post which inspired this entirely as well as helping me with all my questions. Huge thanks also to Sushruta Nandy (again!) and Will Savage (first time!) for reading the first draft and providing some really helpful feedback!