Reports | May 04, 2009 3:02

On the increase of the K-factor (UPDATE)

RatingsSince FIDE announced her own concerns about the planned increase of the K-factor, there is suddenly a hot debate around the subject. It seems as if supporters like GM Bartlomiej Macieja and opponents like GM John Nunn both have strong arguments. This might be confusing for anyone who never really put any thought into the subject, but even if you did, it is not surprising. In a guest post, statistician Daan Zult sheds some light on the discussion. Updated with another Macieja piece.

By Daan Zult

Last week FIDE expressed her concerns about the effects of the new K-Factor which will be in order as from July 1st, and decided to publish two parallel rating lists for a year and then review the results of the different K-factors in effect. GM Bartlomiej Macieja then made strong appeal to not delay the decision but increase the K-factor immediately. Subsequently FIDE sent a reaction which was followed by another reply by Macieja, all to be found in this article. Meanwhile, GM John Nunn expressed a different point of view, saying "there seems no real evidence that K=20 will result in a more accurate rating system, while there are a number of risks and disadvantages."

I have been working with Elo’s model as a statistician for psychological purposes and I have been confronted with choosing a correct K-factor many times. The choice is always one between accuracy and adjustment speed: when you give the K-factor a low value, ratings will change slowly and fluctuate less, and when you give the K-factor a high value, ratings will adjust faster and fluctuate more. This can be a good thing if a chess player stops making progress in real chess strength. For instance, every chess player probably knows a player who was quite talented as a junior, but for some reason stopped making progress at a certain point. A stop in progress should result in a rating that stabilizes around a certain point, and not in a rating that makes crazy fluctuations. The choice of the K-factor is therefore both a choice between two evils and between two goods, like a glass can be half full or half empty. To shed some light on the current discussion, I will discuss some arguments that I encountered recently.

nunn

GM John Nunn

The most recent contribution to the discussion was provided by GM Dr. John Nunn on the Chessbase website. Nunn wonders where the proof is that a K-factor of 24 compared to the current value of 10 will result in ratings that more accurately predict players' future results. To me, his question comes as a bit of a surprise, because the original proposer of a K-factor of 24, Jeff Sonas, supports his statement with empirical evidence in an article that also appeared on Chessbase in 2002. Sonas looked at the results of 266,000 games, and compares the real results with the results that were predicted by Elo ratings. He shows that under the current value of the K-factor the predictions are not exactly similar to the real game outcomes. For example, according to Elo’s formula, a player who has 150 rating points more than his opponent has an expected score of 70%. Sonas shows that whenever a player has 150 points more than his opponent, he only scores 68% instead of 70%. He also shows that when we would have used a K-factor of 24, the 266,000 game results would have been predicted better than they were now. In general Sonas argument is strong, because it is not only based on theoretical, but also empirical arguments.

A second objection that was given by Nunn concerns an argument introduced by GM Bartlomiej Macieja, who states that an increase in the K-factor is a logical result of the fact that FIDE produces new rating lists on a more regular basis. Nunn, on the other hand, states that the K-factor and frequency of rating lists are unrelated to each other. This statement is more or less correct, but he supports it with a wrong argument. Nunn states that when you play 40 games in six months, there should be no difference in ratings when FIDE publishes one rating list every six months or one every day. He would be right if ratings are updated after every game, but in the current setup of the FIDE rating system his statement is incorrect. In article 14.4 of FIDE regulations we can read that FIDE does not update ratings on a game by game basis, but after a series of games played within a period.

FIDE states that they realize that updating after every game would make ratings more accurate, but don’t do it because it is more labor intensive. Another good reason not to update ratings after every game without publishing a new rating list after every game, is that a change in rating would then be based on virtual ratings. This is a problem, because then players cannot calculate their new rating after a tournament, because his opponents might have a different virtual rating than their rating on the rating list.

Another problem with updating ratings after every game concerns closed tournaments. When organisers want to organise a tournament of a certain category, or a tournament to score an IM or GM norm, it is important that the rating of the invited players is fixed and known. Thus, there are enough reasons not to update ratings after every game, although they might become more accurate. But to get back on topic, we illustrate the fallacy in Nunn’s argument by a simple example:

Let’s say we have two players, player A with a rating of 2000 and player B with a rating of 2100. Let’s say they play 40 games in 6 months, and suppose every game ends in a draw. Now let us compare what happens when we recalculate ratings after every game or after all 40 games. First, when ratings are calculated after every game, player A will at first gain points and player B will lose points. After a certain number of games both their ratings will be 2050. However, when ratings are calculated after 40 games, then the new rating of player A would be:

2000 (= current rating)
+ 40 (= # games)
* 20 (= K-factor)
* (0.5 (= real score)
- 0.36 (= expected score, based on Elo’s formula)
= 2112 (= new rating).

We see that something strange happened. A player with a rating of 2000, who played 40 draws against someone with 2100, now has a rating of 2112!? We see that updating after every game results in different (and more sensible) ratings than when there is one update after every six months.

macieja

GM Bartlomiej Macieja

However, this peculiarity will not disappear when we change the value of the K-factor, so in that respect Nunn is right when he says that the K-factor is unrelated to the number of rating lists. However, what Macieja refers to, is the fact that when rating lists occur more frequently, it is less easy to gain in rating. This is also true, because when your rating is increasing, and your rating is updated more frequently, you have a higher rating so you need to perform even better to gain more rating points. However, what Macieja does not mention is that more frequent rating lists also introduce more volatile ratings. When ratings are updated every six months, there is the opportunity for a player to fix a bad tournament with a normal one later on, so his
rating might not change much. While under frequent updating a bad tournament will result in a low rating that might not be representative for his true skill. This means that more frequent rating lists can also be used as a theoretical argument to support a decrease in K-factor.

As we can see, the theoretical discussion does not give a clear-cut solution. In such situations it is common to let the empirical evidence speak. So far the only empirical work on the issue is provided by Jeff Sonas. His work provides a solid base to start the discussion. He clearly showed that a K-factor of 24 would results in ratings with more accurate predictive power. However, his calculations concern a large population of chess players. It seems to me that the ratings of top players deserve special attention, because they have the largest personal interest in accurate ratings and they represent the chess world as a sport. Because it is quite likely that at top level, chess skill is changing more slowly, it is therefore likely that an increase in K-factor leads to less accurate predictions for the results of top players. Whether this is true or not cannot be answered without empirical investigation, which so far has not been conducted.

On a personal note I would like to add that a different solution to this issue is to apply a complete new rating model. Elo’s model was developed in 1960, in a time that calculations needed to be performed without a computer. Therefore Elo’s model is both elegant and primitive. Nowadays more complex and accurate models are available. An important characteristic of more advanced models is that they determine K-factors on an individual level. Because the availability of computers is no longer a problem and the accuracy of ratings has become more important, the use of more advanced rating models might well be worth considering.

Daan Zult is a PhD student at the University of Amsterdam, currently pursuing research into the thinking of chess players.


Meanwhile we also received another piece by Macieja:

The K-factor - here comes the proof!

I couldn't believe my eyes when I read GM John Nunn's opinion: "The K-factor and the frequency of rating lists are unrelated to one another. Rating change depends on the number of games you have played. If you have played 40 games in 6 months, it doesn't make any difference whether FIDE publishes one rating list at the end of six months or one every day; you've still played the same number of games and the change in your rating should be the same.".

It does make a significant difference how often rating lists are published. To understand this effect it is enough to imagine a player rated 2500 playing one tournament a month. With 2 rating lists published yearly, if he wins 10 points in every tournament, his rating after half a year will be 2500+6*10=2560. If rating lists are published 4 times a year, after 3 months his rating becomes 2500+3*10=2530 so it gets more difficult for him to gain rating points in further tournaments. After 3 more tournaments the player reaches the final rating of only about 2500+3*10+3*6=2548. With 6 rating lists published yearly, the final rating of the player (after half a year) is only about 2500+2*10+2*7+2*5=2544. Obviously it is only an approximation, the exact values may slightly differ, however the effect is clear. The rating change, contrary to GM John Nunn's opinion, is not the same. And that's what I meant by: "The higher frequency of publishing rating lists reduces the effective value of the K-factor, thus the value of the K-factor needs to be increased in order not to make significant changes in the whole rating system.".

There are many possible ways to establish the correct value of the K-factor. For sure the following approach desires attention:
Let's imagine 2 players with different initial ratings, let's say 2500 and 2600, achieving exactly the same results against exactly the same opponents for a year. The main idea of the ELO system is that if two players do participate in tournaments and show exactly the same results, their ratings should be the same. You can think about it also as "forgetting about very old results". Please note that it is far not the same approach as used in many other sports, for instance in tennis. In the ELO system, if a player doesn't participate in tournaments, his rating doesn't change (I don't want to discuss now if it is correct or not). But if he does, there is no reason why his rating should be different from the rating of another player achieving exactly the same results against exactly the same opponents.

With one rating list published yearly, as was initially done by FIDE, the value of at least K=700/N was needed to reach the goal. As the majority of professional players play more than 70 rated games per year, the value of K=10 would play its role. However, with more rating lists published yearly, the initially higher rated player will always have higher rating than his initially lower rated colleague (if both achieve exactly the same results), unless the K-factor is extremely high. For this reason it is better to ask the question, which value of the K-factor will reduce the initial difference of ratings by 100 (for instance from 100 points to only 1 point) in a year?

In a good approximation, the answer is K = (m*700/N)*[1-(0,01)(1/m)], where m is the number of lists published per year. For N=80 (suggestion of GM John Nunn), we get: if m=2 -> K should be 16, if m=4 -> K should be 24, if m=6 -> K should be 28. Otherwise, an initially higher rated player may still have a higher rating a year later even if he was achieving worse results than an initially lower rated player. It would not only be strange, but also unfair, as for many competitions, including the World Championship Cycle, the participants are qualified by rating.
Please note, that if N is lower, the K-factor should be even bigger.

Some people suggest that 12 months in a row of showing identical results may still not be enough to consider 2 players to be equally strong (or, to be more precise, to have their initial rating difference reduces by 100). Let's calculate which value of the K-factor will reduce the initial difference of ratings by 100 in 2 years. For N=80 (160 games in 2 years) we get K=17, for N=70 (140 games in 2 years) we get K=19.

I believe that out of last 100 games (it is even more than professor ELO recommended) a sound judgement can be made. It means, that the value of the K-factor accepted in Dresden during the General Assembly (K=20) was a wise choice.

Best regards
Bartlomiej Macieja
3rd of May 2009

Share |
Editors's picture
Author: Editors
Chess.com

Comments

antonio's picture

I have not studied the ELO system in any detail but after reading this nice article it seems to me that the K-factor required to match the predictive power of the actual ELO of the player depends on the periodicity of the update of the ranking list and consequently there is some relation between the optimal K-factor and this periodicity. Also I dont know but it seems to me crutial to know how in the Sonas empirical study the updating of the ranking lists was supposed to be done. A different periodicity of updating would give, I suppose a different optimal K-factor. It was certainly taken into consideration in the study but some differences between the actual results obtained by empirical studies and the expectation could be attributed to fluctuations and no change of K-factor would be meaningful.

Congratulations to the Chessvibes editors for bringing us all the most important events and news on the chess scene.

half-dinosaur's picture

The last paragraph is noteworthy: it is indeed high time that someone at FIDE would sit down, do some real research, and develop a new rating model. Unfortunately, with the kings and dinosaurs there, this is not going to happen any time soon (a bit like FIFA refusing to introduce video technology)...but we can dream right?

Ricardo's picture

Jeff Sonas updated his rating lists monthly, as in 12 updates per year since 1850 until 2004 (as far as I know he interrupted his work once Kasparov retired).

In any case, the method he used to calculate his rating lists is far more complex than just adjusting the K-factor and deals with other problems like ELO inflation, for instance. You can read about it at http://db.chessmetrics.com/CM2/Formulas.asp

Thomas's picture

Totoy, sorry for not replying earlier - I got deeply involved in another thread :) and it is difficult to say what you mean with "huge". But maybe some statistics helps: The Wikipedia page cited by Kashmir has the formula to calculate the "expected score" for a given rating difference.

If my calculations are correct ... : Taking Karjakin-So (110 points rating difference), the stronger player should - as a long-term average - score 65%. In other words, in a single game a draw is the most likely result. But Karjakin would underperform (and/or So would overperform) if he fails to win a 4-game match where he should score 2.5 (theoretically 2.6) points.

Between two players both is possible, a bad day (or week) for one or a good one for the other. If So scored more than 35% against ten different players with an overall average ELO of 2720, there are three possibilities:
1) All (or at least several) opponents show poor form
2) He is in excellent form
3) He is worth more than his present rating

Only for a rating difference of 200 points, in a single game a win is slightly more probable than a draw - the stronger player should score 76%. Taking this as a definition, presently only Topalov has a 'hugely' higher rating than So, but even he should, statistically spoken, draw two games or lose one in a four game match.

Totoy Bato's picture

will somebody enlighten me what is a huge rating disparity? Thomas failed to answer my query.

kashmir's picture

can someone please explain the statistical mechanisms behind inflation and deflation? are there any numerical methods to determine how big it is or which subgroups are more sensitive for it? how are they defined?

there's not much on the web regarding this. i read http://en.wikipedia.org/wiki/Elo_rating_system and some other articles but i found it rather confusing ... thx!

Aleksander's picture

Kashmir: on Macieja's site you can find an article on the subject.

EJ's picture

Nice article!

kashmir's picture

Aleksander, i cannot find that article on Macieja’s site, can you please post a link here?

Bert de Bruut's picture

Never mingle football/soccer with chess, dino! Unlike FIDE, FIFA would be silly to introduce new technologies, for what would we have to complain about after the match, when all the arbiter's decisions are perfect?

Latest articles