Submitted by spicer2 t3_11jvqtb in dataisbeautiful
spicer2 OP t1_jb4hznt wrote
Tools used: Python, Excel
Source: IAAF Toplists
Methodology/other bits:
A question I’ve wondered for a long time is if there’s a good way to measure how dominant athletics records are between events. I remember reading once years ago that, with the help of some statistical trickery, Paula Radcliffe’s (now-broken) marathon world record was considered to be one of the most impressive human feats, up there with Bob Beamon’s long jump, but I never tracked down the original source or methodology behind it.
I’m sure most people here know what a z-score is but I’ll show my working for full transparency. It essentially tells you how many standard deviations from the mean a given score or “exceptional” a given score is. It’s really handy for letting you make comparisons across categories that use different measurements.
To be clear about the sample I used - I took the 100 best competitor’s times, not the 100 best times overall. So Usain Bolt’s score is gauged against Asafa Powell, Tyson Gay etc’s best times, not all of those and also his own. The main reason I did that was because I was most interested in how dominant the individuals were, not the times themselves.
On that note, I really like this chart as it shows just how good Usain Bolt was. I also like that it confirms the 100/110m hurdles is such a tight and unpredictable event where you don’t really get athletes that consistently sweep the board for medals, as you do in others.
PS: I gathered the data for this in January but sat on it for a bit, so some of this may be slightly out-of-date.
kompootor t1_jb507nj wrote
This is just an awesome idea all around, a strong visualization, and I look forward to seeing more expansions on it. Now question/critique/suggestion:
Crucially, over what years is the data taken? On the IAAF Toplists you can get a season cross-section of athletes' PRs (Personal Record, i.e. the athlete's PB that got officially recorded), or you can get all-time PRs through apparently 1899 (although prewar data on this will be terrible, especially for women), of which perhaps either the PRs from the year of the record until now are a better pick per event, or maybe it's better to compare the cross section of PRs at the year the record was won (I don't know offhand). Regardless, the data date range(s) should be put on the graph, and I really think the year of each record should be added in parentheses for each event as well, since that also hints at how big of a statistical outlier that record may have been.
Are those events that were selected with highest z-scores chosen with respect to mens' or womens' events exclusively, or a mix of the two? As it's ambiguous enough about it (it doesn't say "Top 20 events with most dominant records" or something) it seems safe to eyeball a mixed set, but it would still be nice to note.
I agree with the general sentiment that it would be nice to have the list sorted by z-score, but of course that's impossible to do for both men and women in this visualization while keeping parity/sanity of events and thus neatness. It is possible, however, in another graph format, that one may consider playing with in future (as I'm not sure how effective it would be comparatively): You take a 2D x-y graph with mens' event records on the x-axis and womens' records on the y-axis, each overloaded for different record unit types (such that you will have adequate spacing between dots if you just plotted mens' records as dots on the x-axis). Then each event gets a corresponding (x,y) point with a label; the z-scores are indicated by the label and the point having a shape sized correspondingly in the x and y (or else be simply two small bars). Then to read mens' records ascending you follow the dots left to right, and for womens' you follow the dots bottom to top. Just one possibility that someone could do with a dataset like this.
If you do another chart, I'd personally also be interested in some of the most vulnerable athletics records to be put up in the same chart, for comparing something of a baseline. Another idea for comparison, but not as useful and so better for a separate chart, would be an identical visualization using the data between the years when a very famous WR was held, such as one set by Jesse Owens at the 1936 Olympics, or Roger Bannister's 4-minute mile.
kompootor t1_jb54i2q wrote
Sidebar comment from the above in case anyone is interested further: the prewar data that is easily available is terrible, but a lot of it is still out there, poorly summarized in disparate sources. For womens Olympics history, the 1922 Women's World Games, aka the Women's Olympic Games (but confusingly for those trying to research it took place at a very similar time and place to the 1922 Women's Olympiad, with several of the same athletes, and yet several events having just slightly different lengths). Afaict the sources that will be most likely to hold the final incomplete data in the medalist table are a bunch of old contemporaneous Russian-language sports magazines that would most likely be in a national museum or archive in Ukraine or Russia or perhaps another state that has had Russian as a major language. Another interesting thing to look at is the athletes. Mary Lines shows up everywhere as a multi-sport international athlete of the time, but she has a very sparse bio on Wikipedia, as what's widely available on her seems to be poorly cited and/or difficult to otherwise plausibly verify. But a lot of these women (pseudo-)Olympians potentially have very interesting stories, especially those who did not start as, or who were not currently professional tennis players (tennis was basically the only respectable "get-sweaty" sport for women at the time, but stuff like archery and lawn sports were also big).
The politics and logistics of the WWG and early women's Olympics are fascinating too, since the regular Olympics at the time was not at all the quintessential transnational institution it is today. They all struggled with just the basics of funding at all levels, even just to get the necessary grants to bring all of the world's (aka Europe's) top athletes to a single location, so either the women's games could have been viewed by the IOC as a potential popularity/legitimacy booster for the Olympics, or it could be viewed as a competitor for scarce resources and thus an existential threat.
That's my pitch for some obscure sports history. And if you want to do further reading on this or any such topic, I strongly recommend complementing your learning with cited edits to Wikipedia -- that's how I was able to type almost all of the above (and on much much more) from memory still, even though my edits to these topics come from several years ago. Protip: in most cases don't engage in arguments on the site -- just walk away.
RussGOATWilson t1_jb69ltj wrote
This is interesting but can you do the same thing for prior records? E.g. Michael Johnson's 400m record z-score compared to van Niekirk's?
GinandTonicandLime t1_jb6f83q wrote
Hopefully in before legions of Australians ask about Don Bradman
Karnex97 t1_jbxltjg wrote
Hey OP can you do the same thing except only account for results set after 1990. The reason for that is that before 90s doping was legal. Of course there will always be some doped times on the list even after 1990 but percentage would probably drop significantly.
Viewing a single comment thread. View all comments