Reports | May 14, 2009 22:38

On the increase of the K-factor - Part II

RatingsThe hot debate around the K-factor that was going on last week attracted many interesting responses. Well-known rating experts such as Jeff Sonas, Ken Thompson and Hans Arild Runde all contributed with insightful points. The discussion ended with what Chessbase called Dr. John Nunn’s ‘final installment’ but this is unfortunate because the discussion is just getting interesting!

By Daan Zult

In science, of course, there are no ‘final installments’, but more importantly, in this debate no decisive arguments have yet been provided by either side. I was happy to see that this time, Nunn expressed his ideas more clearly, because his first contribution was somewhat confusing. He reformulated his inaccurate argument concerning the frequency of rating lists. He also explains more clearly why he has problems with an increase in the K-factor and Sonas’ analysis.

A strong argument given by Nunn that concerns Sonas' analysis, is aimed at the fact that Sonas uses a different model to calculate the expected score. Nunn gives the example that it is a linear function instead of a normal probability distribution. Therefore Sonas’ optimal K-value of 24 belongs to a different statistical model altogether. Even though Nunn is completely correct in this respect, I think it’s harsh to conclude that therefore Sonas’ analysis have no relevance. The reason is a bit technical, so bear with me (this is the hardest part), but here’s why.

On a small scale Elo’s formula is almost linear. This is important, because most players who play against each other have a small rating disparity. Therefore, for most games, the expected score based on either Elo’s formula or on a linear function will hardly differ. And so, the fact that the Sonas formula is linear does not explain the fact that he finds a K-factor of 24 to predict results so much better than a K-factor of 10.

White scores better
A second important difference between the Sonas and Elo formulas (not explicitly mentioned by Nunn), is that in Sonas’ model the expected outcome of a game is different for Black and White. This can be understood fairly easily. In the Sonas model, when two players with the same rating compete, the white player has an expected score of 0.541767, instead of the 0.5 in the current model. Sonas' expected score is simply a result from the data of real chess games, where White on average scores 54%. In this case it makes perfect sense to say that Sonas' model provides better predictions, whether the K-factor is 10 or 24, since his model simply uses more information than the current Elo rating model does. Personally, I think the question of whether we want to rate Black and White games differently is more of a political than a statistical choice.

Stats MegaBase + TWIC

Stats of the MegaBase + TWIC (4,171,030 games)

The two points I addressed above, concerning the linearity and difference in expected score between Black and White, are basically about the expected outcome of one game between two players with a certain rating difference. Both Elo’s and Sonas' formula provide us with an expected score over one game. However, we should bear in mind that the K-factor is not related to the expected outcome of a game, but to the underlying dynamics in chess skill. In that respect there isn’t a big difference between Sonas' and Elo’s formula, since they both provide a number for the expected outcome of a game, and do not directly affect the dynamics. In that respect, the difference between Elo’s and Sonas' formula cannot fully explain the fact that under Sonas' formula the (much more dynamic) K=24 predicts results so much better than K=10. I therefore consider it very likely that when we use Elo’s formula and try to find the optimal K-factor, than it quite likely is larger than 10.

nunn

GM John Nunn

Cheating argument
Wrapping up all of Nunn’s arguments, he further writes that he considers the ‘cheating argument’ his most important contribution to the discussion about raising the K-factor. Nunn states that cheating becomes easier and more attractive with an increased K-factor. This is true, of course, because with a higher K-factor, you can simply win more points by cheating! At the same time, I can’t help wondering that if this is such a strong argument not to raise the K-factor, then why not decrease the K-factor? Moreover, it does not counter Macieja’s argument that by increasing the number of rating lists it becomes harder to gain rating, thereby already making it less attractive to cheat.

Also, a legitimate question is whether we should use ratings to fight cheaters at all. Originally, cheating was not part of rating considerations, and I personally think this should remain that way. Since no rating/ranking system in the world is able to avoid cheating, it is strange to have it affect the accuracy of the rating model. In my opinion this is the job of respective governing bodies, not the job of the rating system.

Ken Thompson
This brings us to the other reactions on the Chessbase website. Another famous contributor to the debate, computer expert Ken Thompson, also opposes an increase of the K-factor. He states that the only reason to increase the K-factor is to allow ratings of rising stars to increase faster. This makes sense under the presumption that a chess player has some sort of ‘true chess skill’ that does not differ between one week and the next, whereas it might suddenly change within a short period of time, only to become stable again.

However, there are two problems with this argument. First of all, there is no conclusive scientific proof that chess players in fact develop with jumps in skill (although it might well be true, maybe it’s not). In fact, development psychologists are still gathering evidence that can be interpreted in various ways. Secondly, this statement excludes the possibility of ‘temporary shape’, and whether we want ratings to express this temporary shape.

Next, Thompson criticizes Sonas' analysis by stating that the successful model predictions for the results of grandmaster Bu Xiangzhi is the result of ‚Äúcherry picking‚Äù, that is to say, it is simply an attempt to find an example for whom the model works and then use it as proof, while it may simply be the result of coincidence. I don’t agree with Thompson here, because it’s not true that Sonas presents the Bu Xiangzhi case as major proof. Bu Xiangzhi is just an illustration of the quality of the model. The real proof Sonas provides, concerns an analysis of the full population of chess players.

Thompson also states that an increase in the K-factor introduces the risk of inflation or deflation. Well, it’s obvious that rating systems can be vulnerable to inflation or deflation, but I don’t see why this risk is particularly pregnant for a higher K-factor. To me, it seems that for any value of the K-factor (except zero), there is the risk of inflation/deflation. A higher K-factor only magnifies the inflation/deflation effect when it occurs. According to FIDE handbook (article 12) this is carefully monitored.

All in all, Nunn and Thompson give some decent arguments against a rise of the K-factor, but in my opinion, none of them closes the debate. Their arguments show that we should be cautious in interpreting Sonas' analysis, but they do not show that the current K-factor value of 10 is better than any other value. So far, I’d say the only argument that stands is that we have been using this value for a long time and that so far, it has served us well.

Runde

Hans Arild Runde

The real problem
In conclusion, I would like to point out the very insightful contribution provided by Hans Arild Runde who runs the live ratings website. Runde in my opinion managed to pinpoint the real problem of choosing the right K-factor, which is more of a philosophical approach to the issue. What do we want ratings to be, anyway? Do we want ratings to predict “immediate” or “future” results? Do we want to know who is the best player at this very moment, or do we want to know who will perform best in the coming year?

This choice has implications for the K-factor. A high K-factor will produce rating lists that indicate who is in good shape right now, while a low K-factors produces more conservative rating lists, where rankings are likely to hold for longer periods. This is more of a political than an empirical question. Sonas' analysis focuses on predicting immediate results. So it seems that if we want to predict immediate results better, we need to increase the K-factor. However, Sonas' predictions do not consider results that lie further into the future, which might lead to a different value of the K-factor.

In the end, it seems that before we can decide (with the use of empirical research) what the optimal K-factor is, we first need to decide what we want ratings to be.

Link


Daan Zult is a PhD student at the University of Amsterdam, currently pursuing research into the thinking of chess players. ChessVibes thanks Kung-Ming Tiong, Assistant Professor of Mathematics at the University of Nottingham, Malaysia Campus for providing insights and feedback.
Share |
Editors's picture
Author: Editors
Chess.com

Comments

Peter Doggers's picture

do explain, hendrik

frogbert's picture

"If I understand everything correctly, that way we have a simple empirical answer, avoiding philosophical and political debates."

You might have an answer - the question remains, though, did you ask the right question?

When choosing "method" in a scientifical question, the first and most important question to ask, IMO, is what do I really want to know - what am I trying to find out? And when changing something, it's good to have a clear idea of what's "wrong" (or sub-optimal) in the first place, and what one wants to achieve by making a change. Often it's good to consider what one would like to remain "unchanged" (or unaffected by the change) afterwards, too - i.e. one should consider side-effects.

Regarding the question of choosing K, you could for instance ask some 1000 "randomly" chosen members of FIDE (players like you and me) what K they would like to have. It would be a quantitative method, it would provide a simple answer and would come with as much reasoning for it being the "right" thing to do as the exercise suggested above. That is, none.

R.Mutt's picture

Can't somebody just take some big database, and see which k-factor best predicts performances over the two months after a rating list, using the Elo formula? If I understand everything correctly, that way we have a simple empirical answer, avoiding philosophical and political debates.

hendrik's picture

not to good article

Arne Moll's picture

Apart from the K-factor issue, one thing I don't understand about the inflation argument is what's so bad about inflation of deflation anyway, except that perhaps it just 'looks ugly'?

For instance, on ICC, both deflation and inflation can be seen in the two types of blitz ratings that you get: for the games where you do not choose your own opponent or colour (and where a disconnect results in a loss) or time control, ratings are lower (deflated) than one's FIDE or official national rating, on average. For the games where you can choose both opponent, colour and clock settings, ratings are much higher (inflated). But despite how they 'look', these ratings are still pretty accurate in the sense that the best players have the highest ratings.
So what's the problem with this inflation/deflation anyway?

henk's picture

@Arne: The problem might be that ratings are not only used for predicting or measuring results, but also to award FIDE-titles.

Arne Moll's picture

Good point, henk, and one I haven't seen (if I'm correct) in the discussion yet. FIDE wants to keep the ratings on the same level because of their dependency on norms and titles, but isn't that an afwul reason? It sounds extremely artificial to me.

Daan Zult's picture

@frogbert

Thanks frogbert, you express what I try to make clear.

@R.Mutt

I think the first question should be: What do we want ratings to express?

When the answer is:
"we want ratings to predict game results in the coming two months best"
Then your suggestion is correct.

But what if the answer is:
"we want ratings to predict the next game result best"

or
"we want ratings to predict who will be the best on average for the coming two years"

or
"we want ratings to be random numbers"

or
"we want ratings to stimulate agressive chess and distimulate cheating"

or
"we want ratings to make as many players happy as possible"

etc....

Some of these answers are of course a bit strange, but I dont think there is consenses on what the answer to this question should be. Since all these different answers lead to a different results as to what rating model we should use, and therefore also to a different value of K.

Thorn's picture

@Arne: I think you misunderstand the term 'inflation'. Inflation does not mean that one rating system is higher on average (like USCF is higher than FIDE, or FIDE is higher than DWZ), but that one rating system's average rises over time. So players gain rating points without actually playing stronger than before.

That might not be a problem if it's happening slowly, but if a rating average rises at a higher rate it leads to uncertainty.

By the way: I believe that ICC blitz ratings are not only higher on average than other rating pools, but that they are also inflating, thus getting even higher on average.

Cheers,
Thorn

Elz's picture

My oppinion is that ratings should express better the current level of a player. If a player is in a great form it means that he performs also better and his rating should express that. If you look at the rating of a player what you want to see is how strongly he performed last time he played in a tournament therefore his ELO should converge faster to his performance rating. But with higher K-factor comes bigger rating lag. That should be corrigated with more published rating lists per year.
So in conclusion: if ratings would be updated after every tournament (or at least every month) the accuracy of ratings would be higher. This would be the first step to do. Next we can choose a higher K-factor in order to make ratings reflect better the current strength of a player.

4i4mitko's picture

why they don't pay to some mathematician to solve the problem this is quite boring

Elz's picture

And rating inflation/deflation is not influenced by the ELO system. What one player gains the opponent loses so you don't introduce or take away points to/from the system.
When an unrated player enters a tournament he should get as entry rating the average rating of all rated players. If his entry rating is lower than the average you get deflation, if its higher you get inflation.

I might be wrong of course :-)

Arne Moll's picture

@Thorn, I do understand that inflation is not only about higher ratings but about average ratings becoming higher over time (this is in fact also the case for ICC, where the best players 10 years ago had a high score blitz rating of 2900 and now almost 1000 points higher). You say it leads to 'uncertainty' but what do you mean by this? Surely it doesn't mean one can't predict one's rating anymore?

R.Mutt's picture

@frogbert, daan.

Sorry fellas, but you guys just think too much. You might sprain something. ;-)

It strikes me as rather obvious that if you publish a rating list, you want those ratings to predict as accurately as possible how a player will perform until the next list is published (in our case in two months). That leaves you with a relatively simple empirical problem.

frogbert's picture

@Elz
"My oppinion is that ratings should express better the current level of a player. "

Why? And what do you mean by "current"?

"If you look at the rating of a player what you want to see is how strongly he performed last time he played in a tournament therefore his ELO should converge faster to his performance rating."

If you REALLY want the rating to say how strongly he performed (only) in his/her last event, why do you even need the ELO to converge to the performance rating over time - you could just say that the player's rating IS the performance rating of his/her previous event -boom!, just like that.

Or more generally, regarding "current form/level", let's consider Sonas' measure of "the best" rating formula - which he argues is the one with the best prediction strength. Chessmetrics has been optimized to have maximum prediction strength for ONE month into the "future":

I propose the following thought experiment: Imagine that you have some hypothetical measuring device, with the capability of measuring "rating strength" so that it predicts the outcome PERFECTLY for the next 14 days, very well for the next month, decently for a period of 3 months, somewhat sloppily for a period of 6 months and rather unreliably for a period of 1 year.

Imagine that all GMs were measured with this device (let's say it did the reading by making an electromagnetic brain scan, just for the sake of the argument) once every 14 days - or at the first day of every tournament the GMs started in - and that this reading was published twice every month as the official FIDE rating/ranking list. Let's call the new system Brainmetrics.

With rating periods of 2 weeks, the hypothetical Brainmetrics system would be PERFECT according to Sonas' metric of having the lowest discrepancy between predicted score and actual score, and it would "perform much, much better" than Chessmetrics - in fact, the difference in prediction strength between the current ELO system and Chessmetrics would be ignorable compared to the huge difference between ELO/Chessmetrics and Brainmetrics.

What would Brainmetrics rating lists look like? Well, let's look at the rating list of say October 14th 2004 - it would have had the following entries (amongst others):

Jobava Baadur Georgia 2842
Anand Viswanathan India 2824
Kaidanov Gregory S United States 2763
Degraeve Jean-Marc France 2711
Nguyen Anh Dung Vietnam 2692
Rodriguez Andres Uruguay 2672
Grischuk Alexander Russia 2662
Shirov Alexei Spain 2646
Ponomariov Ruslan Ukraine 2626
Nisipeanu Liviu-Dieter Romania 2545
Short Nigel D England 2412

Degraeve of France and Nguyen of Vietnam would probably be top 50 in that list, while Ponomariov would be out of top 100 for sure, and Short would hardly be among the top 1500 players in the world.

And in the list of April 14th 2009, we would have for instance

Akopian Vladimir Armenia 2781
Bacrot Etienne France 2751
Ivanchuk Vassily Ukraine 2670

just to give a more recent example.

[The observant reader obviously sees that if performance ratings could've been produced without the need for any history at all - Chessmetrics uses 12 months history iirc - then Brainmetrics would have actually been easily implementable, as 2 weeks performance ratings...]

I hope that this thought experiment can shed some light on why "best possible prediction" probably shouldn't be the ONLY goal of a rating system. As I said above, Chessmetrics has been tuned for best predictional strength one (1) month into the future (based on a history of 12 months) - while Brainmetrics did this job perfectly for 14 days into the "future".

Hence, to support the claim that Chessmetrics works better than the FIDE system, I think that Sonas needs to broaden his argument. And Chessmetrics DOES have other features than simply being tuned for prediction strength.

In general, leaving how a chess rating system should work to any mathematician or statistician without a thorough prior discussion about what the goals of such a system should be, doesn't really make too much sense to me. For example, one effect it might have, is the creation of conceptually MEANINGLESS notions as "correcting for inflation", with the seductive implication that this somehow makes ratings of players from different eras comparable - it clearly doesn't. :)

frogbert's picture

@ Mutt

"Sorry fellas, but you guys just think too much. "

I'm not sure about that, but I do try to think. :)

Ludo Tolhuizen's picture

Like any mathematician (and statician) would probably do, I agree with Frogbert:
it only makes sense to try to determine the "best" K-factor if " best" is properly defined.

Clearly, a larger K-factor implies bigger rating changes. Hence, it gets easier to satisfy, if only once, the rating requirements for obtaining titles => extra FIDE incomes. Or am I too cynical here?

jussu's picture

Etz,

Isn't an unrated player highly likely to be among the weakest in the field (in about one hour I am going to attempt to prove otherwise but that does not matter)? This way, your proposal would automatically introduce huge rating inflation. One way to assign a rating to a newcomer is to ignore games played against the unrated player when adjusting his opponents' rating, and to assign the newcomer a new rating equal to his performance rating - _after_ the tournament. I think this or a similar system is used.

Latest articles