# eTOI: Estimated Time On Ice

### Introduction

From the time I began this site the question I've gotten most often is "How do you calculate estimated time on ice?". Up until now I unfortunately haven't had much of an answer for that question. When I first started the site three years ago, Stephen Burtch reached out to help me by providing a formula that could be used for time on ice estimates and until now that's what I've used. That formula has worked well but I've always wanted to make one of my own so that I can speak to it better. Here I'll walk through my process of coming up with a new way of calculating estimated time on ice. As a companion to this, the code that I used to perform all steps of this analysis is available on GitHub.

### Simple Linear Regression

The formula I was given uses the percent of total goals that a player was on the ice for (we will call this Goals_On%) in order to approximate the percent of time that the player was on for. This takes the form of the following equation where y is TOI% and x is Goals_On%. Then once y is found you simply multiply that TOI% by the total amount of time that the team played, and you have an approximation for the total amount of time that the given player was on the ice.

Now you may have noticed that we know what x in this equation is, but we'll need values for m and b in order to solve for y. Also, the formula is recognizable as the equation of a straight line, so what we're looking to do is find the equation that describes a linear relationship between our x, Goals_On%, and the y, TOI%. To find this equation, we can use a linear regression where we use Goals_On% as our predictor and TOI% as our target value.

In order to run this linear regression, we will need data where Goals_On% and TOI% is known. Unfortunately, the leagues tracked on this site do not provide TOI data, so we can't use that data for our regression, we will instead need to use NHL data. In doing this we are assuming that ice time distribution for the NHL is comparable to the distribution in the leagues that we will eventually apply this formula to. The NHL and all leagues on this site allow for the same number of players on the ice at one time and the same number of players to dress for a game so this is a mostly safe assumption to make, there are likely some differences in ice time distribution, but I don't think they're significant enough to make this a worthless endeavor.

The NHL data I'll be using is regular season player data I've gathered from Corsica.hockey on all players from 2012-13 through 2017-18. Goals_On% is not directly listed on the site so I had to do some calculations in order to find a value for that. Additionally, I choose to only include records for players with at least 10 games played in that season since players with small samples are likely to yield extreme results. This dataset was randomly split so that 80% of the data will be used for training the models and the final 20% of the data is held aside and will only be used in the final step to validate the selected model.

First, I use a simple linear regression between Goals_On% and TOI%. The result of this is that . This means that if a player is on for 35% of their teams' goals, we would estimate their TOI% is which equals 32.12%. We can then take that TOI% and multiply it by the total amount of time their team played and get the player's estimated TOI. So how good of an estimate is this? To check how well this model works I use a five fold cross validation and look at the mean root mean squared error of those five folds. For this approach the mean RMSE is 75.07, meaning that the average predicted TOI value from a linear regression of this form is about 75.07 minutes off the actual TOI value.

To put this in perspective let's look at the old model. For the old eTOI model the RMSE was 73.73 compared to 75.07 for our new model. We can see that overall the old model and this simple linear regression produce pretty similar results. However, there are issues with this approach that we can improve upon in order to yield a better result. The first of which is that forwards and defense are grouped together in this approach. This is an obvious issue since with four forward lines and only three defense pairs there are bound to be differences in their ice time distribution. To fix this we can instead split the two and make one model for forwards and one for defense.

This gives us for forwards the equation and for defense the equation . The mean RMSE for these models with comparisons to the old model are as follows.

Split RMSE Old RMSE
Forwards 52.06 67.25
Defense 64.68 84.62

So already we are seeing some notable improvements by accounting for additional factors in our model. Let's see what else can be done to potentially further improve our estimates.

##### IPP%

The next thing that may be worth looking at is the impact of point production on predicting time on ice. Since we are looking to predict TOI as a percentage and not a total, we will need to have a point production metric that is represented as a percentage or as a rate. This is because if we used a raw total value our model wouldn't be able to scale to different values of games played. First, we will use IPP(the percent of goals for a player was on for that they have a point for) rather than the sum of their points.

Including IPP in our model we get the equation for forwards and . The results for these models are as follows.

Split RMSE
Forwards 51.69
Defense 64.37

We can see both in the coefficients and the mean RMSE results for this model that IPP% is not a significant factor in predicting TOI%. It did improve the model a bit, but I'm not even convinced that the slight improvement seen is worth the added complexity of including it in our model.

Just to be sure, I decided to not just look at points in the aggregate but also look at the effects of goals, primary assists, and secondary assists on their own.

F model:

D model:

F model:

D model:

##### IA2P%

F model:

D model:

Split IGP% RMSE IA1P% RMSE IA2P% RMSE
Forwards 52.02 51.86 52.1
Defense 64.61 64.54 64.73

Here we see that none of these metrics offered great improvements on predicting TOI%. But since we're predicting TOI% as percentage of the team's TOI then maybe we shouldn't be looking at the percent of GF they were on for that the player had a point on, maybe we should look at the percent of total GF the team had that the player had a point on. For this we will use TPP%, TGP%, TA1P%, and TA2P%.

F model:

D model:

F model:

D model:

F model:

D model:

##### TA2P%

F model:

D model:

Split TPP% RMSE TGP% RMSE TA1P% RMSE TA2P% RMSE
Forwards 51.93 52.13 51.8 52.17
Defense 64.24 64.64 64.48 64.7

Here we see that none of these point production related metrics offered a significant improvement over the simple linear regression only using Goals_On%. This is somewhat understandable since point production is a result of ability and opportunity. TOI% which we're looking to predict is more a result of opportunity than ability so point production muddies the waters since player ability for this task is a source of noise in the dataset. However, we can look at a different source that is less impacted by ability than point production is, shot generation. Here we will calculate TShP% to be the percent of the team's total shots for that were taken by this player.

##### TShP%

F model:

D model:

Split RMSE
Forwards 50.68
Defense 62.71

We can see here that TShP% offers a noteworthy improvement in RMSE for both the forward and defense models. This is an improvement upon the model only looking at Goals_On% and upon the models looking at different point production metrics so this metric is capturing some information about the players TOI% that the other approaches were unable to.

In addition to these metrics that look at point production as a percentage value we will look at metrics that look at it as a rate. Since we don't know TOI we obviously can't use per 60 metrics for our model but we can look at per game metrics. For this we will look at G/GP, A1/GP, A2/GP, P/GP, and Sh/GP.

F model:

D model:

F model:

D model:

F model:

D model:

F model:

D model:

##### Sh/GP

F model:

D model:

Split P/GP RMSE G/GP RMSE A1/GP RMSE A2/GP RMSE Sh/GP RMSE
Forwards 51.95 52.12 51.82 52.16 50.9
Defense 64.27 64.65 64.49 64.66 62.64

Here we see comparable results to the TPP%, TGP%, TA1P%, TA2P%, and TShP% metrics. Out of all the models so far, the one I prefer and will use going forward is the model using TShP%. I choose this model over the similarly performing model with Sh/GP since TShP% is a percentage of team involvement so there are some nice symmetries between that and TOI% but you could opt to use Sh/GP instead here and should yield similar results.

### Huber Linear Regression

So far, all the models we've been using have been ordinary least squares linear regressions. Ordinary least squares regression can be useful in many contexts although there are some downsides to it that should be considered. One of these drawbacks is that in datasets containing outliers, the line of best fit can end up being pulled towards the outliers and away from the majority of the data which is normally distributed. There are several robust linear regression approaches that have been designed to address this problem and here we'll be using the method created by Peter J Huber which will be able to account for outliers in TOI%.    Values for TOI% in these graphs that differ greatly from the overall trend of the data could potentially skew the results from an ordinary least squares regression

After grid searching to find optimal parameters for our Huber linear regression we arrived at the following models.

F model:

D model:

And for leagues where individual shots are not tracked, and we only have Goals_On%, the Huber linear regression gives us the following models.

F model:

D model:

Split RMSE With TShP% RMSE Without TShP%
Forwards 49.71 51.38
Defense 61.85 63.44

This change to the Huber linear regression doesn't offer a drastic improvement in terms of model performance but considering that there's no added complexity from this approach compared to the simple linear regression I feel that it's worthwhile.

Now that we have decided on the final model we will be using we can test the model using the testing dataset which we have held out until this point.

Split Test RMSE With TShP% Test RMSE Without TShP%
Forwards 51.55 53.98
Defense 70.85 69.57

These are the models that I ultimately decided to use for the sites new time on ice estimates. In this process I also considered some models using higher order polynomials and while these did offer some improvements in RMSE during cross validation I decided to not use these models due to erratic behavior at the upper and lower limits of Goals_On%. The graphs for the higher order models are available in the GitHub repository for this project.

### Conclusion

The goal of this was to improve the sites time on ice estimates as well as provide visibility into how we arrive at the models being used for eTOI on this site. If anything about these models or the provided code is unclear to you please feel free to reach out to me by tweeting or messaging me on twitter @3Hayden2 or by sending an email to 3Hayden2@gmail.com.