Cross Metrics: The Analytics Revolution Comes to Cyclocross

Advanced analytics are all the rage in sports right now.

From the NBA abandoning the mid-range jumper for treys and layups to the NFL treating running backs like second-class citizens, a focus on analytics done by pointy-headed Ivy Leaguers is remaking the way sports are played and rosters are constructed.

Advanced analytics have yet to break into cyclocross, but during a recent episode of Cyclocross Radio, our crew of Bill, Micheal and me, on the spot, came up with a metric that quickly became known as the On Podium Percentage, or OPP for short—all credit is due to Bill because, well, you know, he was old enough to remember when that song was released.

Bill and I are admittedly huge basketball fans, so we have long hinted at the idea of developing “advanced metrics” for cyclocross. The on-the-spot creation of the OPP inspired me to think more about what we can measure by delving deeper into rider results.

This post shares some ideas I came up after some intense brainstorming. I am not a pointy-headed Ivy Leaguer, just an over-educated state school product, so while I am likely not destined for a spot in the front office of my Milwaukee Bucks, hopefully there are a few interesting nuggets to be gleaned from this exercise.

The Assumptions

As I more or less said on the first edition of the Groadio power rankings last summer, when it comes to dealing with numbers, it is not as much about the numbers as the assumptions you make before you do the calculations. The same is true for the Cross Metrics presented here.

Before we go further, many thanks are due to and for compiling results into easy-to-use databases.

Despite their work, we are still limited to numerical results. In a perfect world, we would have lap-by-lap results and times, but for now, we are limited to results for UCI races domestically and abroad.

For this exercise, I did calculations for international riders and domestic riders, with some caveats.

International events were defined as World Cups and Low Country events that held Elite races—Elite Euros counted, U23 and Junior Euros did not. National Championships also were not included.

Domestic races were defined as Elite American and Canadian events. The World Cups did not count since they were included with the International races. Same story holds for National Championships.

While the UCI, CrossResults and USA Cycling use a rolling 12-month window, the first-ever Cross Metrics were calculated based on only the 2019-20 season. While there is some continuity between seasons for riders, cyclocross has shown itself to be a fickle sport, and a rider’s performance one season is not necessarily the same as it is in another.

Other final assumptions before we move on is a rider has to have 5 results to qualify for inclusion in the CM and the max score for any one race is 30, whether it be a DNF or really bad race.

The Metrics

Now that I have made an ass out of you and me, it is time to move on to the metrics. My ostensible forte is in statistical hydrology, so there are likely more holes in these metrics than a 32-spoke rim, so feel free to let me know what I can do better.

OPP – On Podium Percentage

Any way you cut it, a podium finish is the gold standard in cycling, so what better way to kick off the Cross Metrics than with the On Podium Percentage, or OPP. Yes yes, you know me.

Calculation the OPP is simple enough:

OPP = Podium finishes / Races Started

The numbers that stand out for OPP are Mathieu van der Poel’s 100% OPP and the 92% OPPs of Ceylin Alvarado on the international stage and Maghalie Rochette at home in North America.

Admittedly, also surprising is the 36% OPP for defending Elite Women’s World Champ Sanne Cant.

If you are looking for a counterargument to the final CX Heat Check that put Kerry Werner in the Number 1 slot ahead of Curtis White, you could look to White’s 88% OPP to Werner’s 65% for the same metric.

WAPP – Wide Angle Podium Percentage

Cyclocross Radio’s The Media Pit is part of the Wide Angle Podium network (click here to subscribe!), so it only makes sense that there is a Wide Angle Podium (TM) Percentage metric that measures how often riders finish in the Top 5.

WAPP = Top 5 Finishes / Races Started

PSZP – Podium Scrub Zone Percentage

The Scrub Zone of cyclocross has been well established thanks to the work of Colin Reuter and others, but credit for the Podium Scrub Zone goes to my friend Narayan, as far as I know.

For our purposes, the Podium Scrub Zone—or “thereabouts” as they call it in Europe—is defined as 4th through 6th place. Good results no doubt, but still the scrub zone with respect to those who get to stand up on the stage and get the accolades.

PSZP = 4th – 6th Place Finishes / Races Started

Bad Legs Days

Pro or amateur, we have all had those “bad legs days.” Whether a BLD be a lame excuse or a legit bad day, we are allowed an off day.

Unfortunately, those supposed “bad legs days” happen more often for some than others. The Bad Legs Day (BLD) metric tries to get at those riders who are most prone to off days.

Bad days are obviously relative from rider to rider. In recent years, a second-place finish is considered the end of the world for Mathieu van der Poel, while it might take a 10th or worse finish to be considered an off day for others.

For the Cross Metrics, a finish 5 places or more worse than that rider’s median finish for the year is considered a BLD. The decision to use 5 places worse and the median result are admittedly both arbitrary, so feel free to argue otherwise.

BLDs are reported as both an absolute value and a percentage of races started.

80% Rule

Consistency can be an underappreciated thing in cyclocross, especially for riders who are not necessarily on the podium week in and week out. The 80% Rule metric seeks to get at a rider’s results consistency.

It is not, as one might expect, named for the number of times a rider got pulled by the 80% rule. What it does is identify the range that 80% of a rider’s results fall between.

Now if we were doing Science! It would be the 95% rule, and we would make sure our p value was less than 0.05. However, since this is cyclocross, it made more sense to calculate the 10th and 90th percentiles for a rider’s results.

Keep in mind, this is a statistical analysis of the range a rider is most likely to finish within.

The most impressive of these are Van der Poel’s 1-1 and Rochette’s domestic 1-1.

Speaking to that measure of consistency, new Tormans CX Team teammates Quinten Hermans and Corne van Kessel have 80% Rule spreads of 2-8.6 and 3-8.5, respectively. So even though they are not necessarily winning races, they are pretty consistently finishing on the edge of the podium and no worse than about 8th or 9th place.

Average Placing Difference

The Average Placing Difference, or APD, is another metric designed to assess a rider’s consistency. The metric measures the average difference from a rider’s median finish across all races.

APD = Σ│(Finish – Median)│ / Races Started

**The little │ thing means absolute value and sigma ( Σ ) stands for sum

For APD, Van der Poel provides an interesting case study because as it stands, he has an APD of 0.1, but if he had taken a DNF or Van der Quit at Ronse like he was prone to do in the past, his APD would have jumped to 1.4.

That would not have necessarily captured how well he was racing, but sometimes, at least theoretically, the math is the math.

Internationally, the APD shows how well the top women have ridden. Alvarado and Worst have APDs of 0.9 and 1.6, which are much better than the 3.6 sported by Sanne Cant.

Lucinda Brand’s value is 3.8, but she falls into that theoretical Van der Poel situation after getting crashed out at Loenhout and taking a DNF. Without that 30 value, her APD would be in line with Alvarado’s.

A Sampling of Cross Metrics

Cross Metrics for what is admittedly an arbitrary number of top domestic and international riders are shown in the tables below.

To keep things interactive, the data are sortable by each Cross Metric.

Domestic Women: 2019 CrossMetrics

OPPWAPPPSZPBad Legs DaysBLD Percent80% Rule80% SpreadAPD
Maghalie Rochette0.920.920.0010.081.0 -
Clara Honsinger0.870.870.0020.131.0 -
Caroline Mani0.571.000.4300.002.0 -
Rebecca Fahringer0.750.850.1030.151.0 -
Courtenay McFadden0.400.530.3330.202.0 -
Jenn Jackson0.290.590.3530.181.6 - 11.610.03.2
Katie Clouse0.670.830.1710.171.5 -
Ruby West0.310.380.2350.383.0 - 13.810.83.2
Raylyn Nuss0.270.500.3170.272.0 -
Sammi Runnels0.110.320.2620.113.8 -
Caroline Nolan0.400.730.4020.131.0 -
Ellen Noble0.250.250.1740.331.2 - 18.717.55.4
Sunny Gilbert0.220.560.4450.283.0 - 13.310.33.1
Madigan Munro0.330.330.1700.002.5 -
Lizzy Gunsalus0.140.210.1440.292.9 -
Crystal Anthony0.150.310.1520.153.2 - 14.411.23.8
Hannah Arensman0.220.220.0000.002.6 -

Domestic Men: 2019 CrossMetrics

OPPWAPPPSZBad Legs DaysBLD Percent80% Rule80% SpreadAPD
Curtis White0.880.880.0620.121.0 -
Kerry Werner0.650.850.2520.101.0 -
Gage Hecht0.550.730.2720.182.0 -
Michael van den Ham0.380.500.1940.251.0 -
Stephen Hyde0.710.710.1420.141.3 -
Lance Haidet0.430.570.2120.141.3 - 11.810.53.8
Drew Dillman0.240.410.2440.242.2 - 16.414.24.3
Eric Brunner0.380.500.2520.252.4 - 21.619.26.5
Lane Maher0.310.500.3140.251.5 -
Tobin Ortenblad0.250.250.0030.191.5 - 17.516.05.6
Cody Kaiser0.060.380.3830.194.0 -
Jamey Driscoll0.400.400.1040.402.0 - 16.414.45.9
Sam Noel0.250.330.2540.332.1 - 14.912.85.2
Eric Thompson0.210.430.2160.433.0 - 20.817.86.5
Travis Livermon0.160.580.4280.423.0 -
Cody Cupp0.270.550.2740.363.0 -

International Women: 2019-2020 CrossMetrics

OPPWAPPPSZPBad Legs DaysBLD Percent80% Rule80% SpreadAPD
Ceylin Alvarado0.920.920.0800.001.0 -
Annemarie Worst0.790.920.1320.081.0 -
Sanne Cant0.360.480.3230.122.0 -
Lucinda Brand0.780.890.1110.111.0 -
Maghalie Rochette0.110.220.1120.224.2 -
Yara Kastelijn0.570.830.2640.171.0 -
Inge van der Heijden0.210.470.2660.323.0 - 18.615.65.4
Clara Honsinger0.110.220.3330.333.8 -
Katerina Nash0.250.380.1320.251.7 - 15.914.25.5
Katie Compton0.280.390.2250.282.7 - 15.512.84.6
Alice Maria Arzuffi0.160.360.3240.163.0 - 16.813.84.8
Lucia Gonzalez Blanco0.000.000.0020.2217.6 -
Ellen Van Loy0.030.260.2670.215.0 - 15.710.73.6
Caroline Mani0.000.000.1330.386.7 - 23.516.86.5
Kaitie Keough0.000.310.3150.384.2 - 18.614.45.5
Rebecca Fahringer0.110.220.1110.113.8 -
Laura Verdonschot0.130.290.2550.213.3 -
Eva Lechner0.090.270.2710.054.0 -
Anna Kay0.240.330.2490.433.0 -
Loes Sels0.000.100.14110.385.8 -

International Men: 2019-2020 CrossMetrics

OPPWAPPPSZPBad Legs DaysBLD Percent80% Rule80% SpreadAPD
Mathieu van der Poel1.001.000.0000.001.0 -
Toon Aerts0.560.760.2830.121.4 -
Eli Iserbyt0.670.810.1940.151.0 -
Laurens Sweeck0.460.500.1260.232.0 -
Michael Vanthourenhout0.260.560.3740.152.6 -
Lars van der Haar0.150.540.5450.193.0 -
Quinten Hermans0.440.560.2830.122.0 -
Corne van Kessel0.190.420.4220.083.0 -
Gianni Vermeersch0.110.210.1450.183.7 -
Felipe Orts Lloret0.070.070.0030.208.4 - 27.819.46.6
Tom Pidcock0.380.620.2940.192.0 -
Tim Merlier0.270.550.2720.092.0 -
Marcel Meisen0.000.000.0050.247.0 -
Michael Boros0.000.000.0010.1117.4 - 25.88.421.8
Jens Adams0.000.190.2660.225.0 - 20.815.84.1

4 thoughts on “Cross Metrics: The Analytics Revolution Comes to Cyclocross

  1. This is great! Thanks!

    I’m also a data scientist by day and a CX fan by, uh, day, too, so I can’t help but think about ways to try to expand or extend what you’ve already done here. One idea so far:

    XCxCX. In cross-country running (a.k.a. XC), team scores are computed as the sum of the placings for the first five finishers from each team. So, what if we treated a single athletes’ best results over a rolling window of, say, 10 events like a team score from a single event? The best possible score would now be 5 instead of 15, while the worst possible score would depend on the events used, but in practice it would mostly smooth out over 10 events, and who cares about the comparability of the scores close to that tail of the distribution anyway? In contrast to the various podium-based metrics, this would give more weight to better finishes, and it would discriminate more among athletes further down the results list, too. A little post-computation transformation might be in order to get a distribution we really like, but we’d have to see the raw numbers to figure that out.

  2. How about a good ol’fashioned standard deviation thrown in there? Does someone else already do that?

  3. @Jay … Interesting idea. I think that is kind of similar to what CrossResults does. I am admittedly not a data scientist, so anything past podium finishes divided by races is probably mostly bush-league level mumbo jumbo from me.

    @Davis … I played around with using means and standard deviations. I opted for the median and the made-up deviation values because both the means and standard deviations were skewed by bad results. I wanted a way to eliminate the effect of outliers.

    For example, the mean and standard deviation of Iserbyt’s results are 4.4 and 5.6, which does not, IMO, accurately reflect the results he has gotten.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.