Stats in junior hockey are extremely limited compared to what's available for the NHL. Many leagues have nothing past simple point production metrics for their players. However, with hockey analytics growing some junior leagues are expanding the stats for their leagues to keep up. The QMJHL was the first of the major junior leagues to do this by adding shot location tracking for their league beginning in the 2012-13 season. Then, in 2015-16 the OHL also added shot tracking giving those of us interested in hockey analytics some interesting new data to explore. My site has used this shot data to present stats related to danger zones. Danger zones split the offensive zone into a high danger, medium danger, and low danger area based on how close to the net the shot was taken. The concept of danger zones is intuitive to hockey fans since we generally recognize that shots from the slot are more dangerous than shots from the perimeter or the blue line. However, there are some issues with breaking the offensive zone into these three regions. This point has been argued many times before so I won't be going too in depth on this point but it suffices to say that this type of binning doesn't fairly represent the true danger of shots. An improvement upon this is a model that can properly account for the range of shot locations and factors. Tomorrow I'll be releasing an article on my xG model which does this but before that I wanted to give some background on the shot location data that is used in that model.
To build a reliable model, it's crucial that your data is consistent and not being distorted by some form of selection bias. In shot location tracking there are several sources of potential bias that must be adjusted for when building a model. We can easily see that selection bias exists in the raw data by looking at the following two plots.
The first is the shot location data for the Mississauga Steelheads from the 2016-17 OHL season. The second is the shot locations from the 2017-18 OHL season. You can see that the distributions are far off from each other in a way that's almost assuredly not representative of the actual shot locations. These are the types of issues in the raw data that we would like to account for.
For the remainder of this article I'm going to walk through my process of cleaning the shot location data so if you aren't interested in the process of adjusting data for consistency you may want to skip to the summary section.
First, I want to get an overall feel for the distribution of the data. For this, I'll just look at the distribution of X and Y locations by making some simple histograms.
Everything here makes sense, X location ranges from 0-600 and we would expect them to be highly concentrated before the 300 point as that would be center ice. Y location ranges from 0-300 and we would expect shots to occur at all points along that range but with more towards the center than there are on the perimeter and those assumptions bear out in the data. From this we can see that the shot locations are all occurring in the zone that we would expect them to be.
Generally, we want to start from a birds eye view and then drill down to the more specific subsections gradually. The first split we'd want to look at would be the differences between the locations recorded in the OHL and the QMJHL.
Here we can see there is a difference in the distributions of shot location recordings between the OHL and the QMJHL. Personally, I don't think this is a difference that we should adjust for since it makes sense that shots in the OHL would not follow the same distribution as those in the QMJHL. Those who follow these leagues closely tend to agree that the two play a slightly different style of hockey so some difference in where shots occur is a reasonable expectation.
Next, I'm interested in a seasonal breakdown of the different distributions. For this I'll make a few violin plots showing the distribution of X and Y locations grouped by season.
There are a few noteworthy things that we can see in these graphs. First, X and Y data for the 2017-18 OHL season is inconsistent with the two previous seasons. In X locations we can see that values follow a similarly styled distribution but have a different range. For Y locations, the range is the same but the distribution is more concentrated at the center than previous seasons. There weren't any changes to the rules that would cause a dramatic shift like this season to season so I think this must be caused by a change in the recording process for these seasons. These differences are significant enough that I think it's worth adjusting for them.
As we saw in the earlier graphs of the Steelheads 2016-17 and 2017-18 shots, the X location range from the 2017-18 season lines up more with what we would expect while in the 2016-17 season shots seem to be recorded too far from the crease and too far from the blue line, ending up compressed towards the center. Because of this, I'd like to adjust the X locations for previous seasons to fit the 2017-18 season. For the Y location, I feel that 2015-16 and 2016-17 have distributions more representative of where shots in hockey tend to occur than 2017-18 so I'm choosing to adjust the Y location to fit the 2015-16 distribution. I use cumulative distribution functions to adjust the X distributions of OHL seasons to fit 2017-18 and the Y distributions to fit 2015-16. This gives us these plots for our seasonal adjusted X and Y locations in the OHL.
With this adjustment, we no longer have the issue of X values being scaled differently or Y locations having widely varying distributions.
In the QMJHL there isn't as much seasonal bias as we saw in the OHL, and considering how small the slight differences that we see are I don't feel it's necessary to adjust the QMJHL shots for seasonal bias. What I would like to do though is make sure that the scale for QMJHL shots aligns with the scale for OHL shots. Since QMJHL and OHL games are played on equally sized rinks and shots are recorded on grids with identical dimensions, the difference in scale is likely not a true representation of difference in where shots occur. The Y locations exist on the same scale in the QMJHL and OHL, but the X locations are more compressed, like the 2015-16 and 2016-17 OHL seasons. To adjust for this issue, I change the scale of the QMJHL X locations for each season to be in line with that of the 2017-18 OHL X locations. I do not use cumulative distribution functions to do this as I want to maintain the distribution that is unique to the QMJHL, giving us these adjusted X and Y locations in the QMJHL.
The next factor we'd like to account for is change from rink to rink in how they track locations. Since different rinks have different trackers we can expect that each tracker has their own unique bias that ought to be accounted for. For this we'll look at the distributions of the recorded locations for a team's shots when they are playing at home compared to when they are playing on the road.Hold left mouse button to expand image
The home distributions for both X and Y locations vary quite a bit from the road distributions for that same team. This is an easy way to see that bias exists in different rinks. The home games are all being recorded by the same people so their bias is accentuated in those games, while the road games are being recorded by many different people with different biases and the aggregation of those is more likely to give us a fair representation than the locations from home games would. This helps to show that rink bias exists in the data but I don't want to adjust rink bias at this level. Teams may have had different trackers in each of the seasons that we have data for so we don't want to rink adjust the data from all seasons at the same time. Rather, we want to adjust the rink bias for just one season at a time to account for different home rink biases in different seasons. The following two graphs show the importance of this.Hold left mouse button to expand image
As an example, you can see above that the home rink bias for the Saginaw Spirit changes from 2016-17 to 2017-18. In 2017-18 the home rink bias appears to overstate shots close to the net while the opposite is true for 2016-17 where the home rink assigned them more shots near the blue line than away rinks did. If we adjusted the rink bias across all seasons at once we would be improperly accounting for the change in rink bias between these seasons.
For each season I use cumulative distribution functions to adjust the distribution of a team's home games in a season to fit the distribution of their away games in that season, giving our final season and rink adjusted X and Y locations.Hold left mouse button to expand image
Adjusting for biases in recorded data is crucial if you're hoping to make comparisons based on that data. If you're looking at unadjusted data it's incredibly easy to fall into a trap where you end up paying attention to differences that are simply noise and not representative of the actual differences in the dataset. Because of this, I've updated all stats on my site that are based on shot location to use these new adjusted shot locations. This allows for the data to be both more descriptive of what occurred in a game as well as more predictive of what will likely happen in future games. With adjusted data there's far more value that we can pull from our limited data set than we could otherwise, and in tomorrow's article on my new xG model I hope to demonstrate the type of value that can be gained by using adjusted measurements such as these. If you have any questions or concerns please tweet at/message me @3Hayden2 or reach out through email 3Hayden2@gmail.com.
: Credit to Cole Anderson of Crowd Scout Sports for first writing about this method of adjusting for home rink bias. Source.