Now I’m sure Mr Elo was a very nice man, but me and he are going to fall out. Well, not so much me and him as me and the rating system he left behind. More accurately still, me and how that system is working. Because it ain’t.
At all.
Not where I play.
Here’s a word we didn’t used to have when I was a boy: ‘passionate’. Well, we had it, but not in the way we do now. People didn’t used to bang on about it all the time. How they are passionate about this, that or the other.
Passion is everywhere now. Having it is virtually an obligation of citizenship. Or employment, at least.
A year ago this week several papers ran a story on the handful of jobs available at a new branch of a chain coffee shop in Nottingham attracting 1701 applicants. “... my passion for coffee just shone through”, one of the successful eight felt obliged to say. “I have so much passion for coffee it’s unbelievable” claimed another. To be fair, they'd hardly have got the gig had they said otherwise, would they? Twelve months on, I wonder if they, their employer or anybody else has given any thought to the question of what the fuck somebody who actually was genuinely passionate about coffee would be doing working at Costa.
Passionate? I’m passionate about not starving to death and not being permanently excluded from a consumer economy merely by accident of coming of working age just as a recession hit. Infinitely reasonable and entirely honest, but the interview technique of somebody destined to find themselves in the group of 1,693, I feel.
It really moved me
Fit for purpose. Or, more frequently, unfit for purpose. That’s another one. Nobody used to say that either. In my day we made do with ‘doesn’t work properly’ and - call me reluctant to move with the times if you will - I still find that particular form of words to be perfectly serviceable.
So let’s just come out and say it: I don’t think the Elo system works properly. Not where I play, amongst the opponents that I face, at least. And not on a temporary basis that will be sorted out by some kind of self-correcting mechanism, but permanently and irretrievably frigged.
I shall expound on this at some length in due course. For now, I’m kind of curious: does anybody agree with this thesis at all?
Aye
19 comments:
You haven't presented a thesis yet, other than "it doesn't work". Actually the ELO system works very well at determining the relative strength of players and estimating win likelihoods. Perhaps you're expecting too much?
Feel free to substitute for ‘conclusion’ for ‘thesis’ if you prefer anonymous.
"Actually the ELO system works very well at determining the relative strength of players”
Depends on which group of players we are comparing I suppose, but I disagree.
You may be correct in your final statement, however.
Well, the only reason I see, why Elo might not work for certain groups of players, is that they don't play enough rated games and the k-factor is too low to account for relatively quick changes in playing strength.
I think it would be desirable if all national ratings would be replaced by Elo. My personal beef with this situation is that it seems entirely realistic, that I might reach the playing strength of a Fide-Master without getting the title, as my Elo is lagging further and further behind (as compared to the national rating).
Phille
Well, the only reason I see, why Elo might not work for certain groups of players, is that they don't play enough rated games
Well do pop back over the next week or two Phille. I’ll show you a couple of things that may lead you to change your mind.
Presumably by Elo, you really mean the International ratings computed by FIDE. For at least some players, these are now badly out of line with the English domestic grades which are computed on different principles. Many countries, Scotland, Ireland and Wales included compute domestic Elo ratings. These use the same principles as Elo's original formulations, but with some additional hacks and modifications. Scotland in particular will deal with the improving junior problem by resetting the rating for anyone playing 200 Elo points better than their previous.
So Elo rating is really just a method and the FIDE International ones just an example. In the UK, England anyway, we do have a particular problem that a few years ago, the ECF flattered our strength by adding around 25 ECF points to players at the average level, moving them from around 115 to around 140. Also the method now used for junior players consistently overstates the grade of the most active ones, although not in a way that boosts grades of adult players.
The main problem is that only a subset of games played by English players and graded by the ECF feature in the International list. That's unlikely to change for any number of reasons.
The problem, as I see it, with the FIDE rating system (national Elo systems may have additional rules that avoid this problem) is that it has an inbuilt zero-sum assumption. That is to say, it assumes the players in the system are, as a group, not gaining or losing total strength.
This wasn't a terrible assumption when the rating floor was 2200 - most players of this strength or higher are changing in strength relatively slowly. As the rating floor has gone down, so has the validity of this assumption.
In the UK, England anyway, we do have a particular problem that a few years ago, the ECF flattered our strength by adding around 25 ECF points to players at the average level
This is one of the things that people say to explain the breakdown in the ‘conversion’ link between ECF grades and Elo ratings - i.e. it’s not that Elo is too low but that ECF is too high. This is something i intend to come back to in a while ... but suffice to say I’m not convinced by this argument
Jonathan - based on looking at the available data, the ECF to FIDE conversion (or a slightly modified version) seems to work OK still on average - is your contention that there is a specific subgroup of players for which it works significantly less well?
Also in corporate-coffee-bollocks news
@Matt,
the ECF to FIDE conversion (or a slightly modified version) seems to work OK still on average - is your contention that there is a specific subgroup of players for which it works significantly less well?
Yes. (Although I’m not 100% convinced that it’s working that well in general either).
The ECF ratings are a joke too. There's one person within a handful of points of myself. If I played him in a match and won 8-2 I would consider that a terrible result. And no I'm not improving.
@Jonathan
Any particular analysis you've seen / would like to see on whether it's working or not?
@Matt Fletcher: I realise your offer was made to Jonathan and this might be a bit hard to do but: how about a comparison of players' ratings of some years ago (before FIDE dropped the rating floor to whatever it currently is) to the same for now?... I'm wondering whether there's a large group of players who were previously rated up to, say, 2200, and who now have a much lower rating?
@AngusF can have a go and see what I can do - do you know off the top of your head what date the lower limit changed?
The rating limit dropped from 2200 to 2000 in the early 1990s. At the same time this harmonised men and women as previously women cut off at 1900, increased to 2000 by adding 100 to every woman except Susan Polgar.
This was before the 4NCL had even started, so when the adult players in the 4NCL were 160 plus, they almost all got ratings and above 2000. The plan to extend ratings to much lower came several years later, probably 1998 or 2000. They extended it downwards with caution, so it's only recently that it has reached the ultimate level of 1000. The difference now as compared to when the cut off was 2000 is that losses to lower rated players are now likely to count. This hits both directly and indirectly as you no longer get the odd potentially easy points (from a 180 perspective) against a career 160 with a 2100 FIDE.
RdC
Jack Rudd wrote: "[Elo] assumes the players in the system are, as a group, not gaining or losing total strength."
This is correct. Or, to express it differently, the problem is that chessplayers WANT the rating system to reflect strength on some cardinal scale, instead of the ordinal scale which it is.
No amount of statistical tinkering (k-factor or whatnot) will ever fix the fact that young players come in with low initial rating, improve rapidly as they gain experience, and suck the rating points from non-improving adult players.
The most practical fix would be to give ratings to a pool of fixed software on fixed hardware, then periodically enter these known checkpoints into rated tournaments for ratings correction purposes.
Oh, there are statistical tinkerings you can do which will fix that: you could, for example, put in a rule that juniors count as unrated for the purposes of adults playing them. (This might or might not be a good rule, mind you, but it's certainly possible to do this.)
Funny the conversation should take this turn. Do pop back tomorrow.
My friend Mr Hogg, unaccustomley gruntled, made several comments:
1. Elo schmelo! Sydney chess expert and math guru Roger Cook operated a rating system suspiciously similar to, and prior to, the good prof Elo's efforts. In days of yore reported in sumpin' called CHESS, Sutton, Coldfield he opined.
Stigler's Law in operation?
Dare one use the plagiarism word in the absence of Mr Keene?
2. ELO is a heuristic ie rule of thumb, having very feeble if any connection to maths or statistics. Glicko has better credentials: currently used by the Australian Chess Federation I believe.
3. Prof Elo's magnus opus was published in that august peer reviewed journal, The Journal of Gerontology. Stop laughing, I am serious! And of course you will ask why not in a serious math or stats journal? See point 2 above.
4. Those redoubtable Kaggle boys in Melbourne, now relocated to Silicon Valley ran two contests to find a good rating system for chess and offered nice prizes too. ELO methods IIRC finished waaaaaay down the list as a useful method.
5. ELO's are published without any indication as to statistical accuracy: what is a measurable difference? Does a, say, 2700 player have a measurable different performance to a 2710 player?
A 2720 player? If not then how can "ratings" be allocated down to one point?
Post a Comment