Preliminary note:
This article was written in the original in German. The English translation comes from the author
too. A sufficient quality of the translation cannot be guaranteed.
Update from December 2023:
New input field: 'minimum evaluation > 0 for White win = 100%
for calculating the WDL analyses #(.##)':
This value should empirically be around 2.35 when using the Stockfish engine; starting from this evaluation, positions are
statistically considered 100% won and from the negative amount of this evaluation 100% lost; almost all WDL analyses in the
programme are based on this value and gain significantly in precision through it.
Notes to the form:
Not all 13 input fields are to be filled in with parameters. If the programme misses information, various error messages are
displayed in red and flashing.
Alternative move evaluation(s) for ⊙ White ⊚ Black:
Important for analysis of the high and suboptimal move evaluation.
High 'move evaluation (-##)#(.##)':
Between -999.99 and 999.99.
Suboptimal 'move evaluation (-##)#(.##)':
Between -999.99 and 999.99. If 'White' is selected, this one must turn out smaller than the high move
evaluation, in case of 'Black' it is purely numerical higher on the horizontal x-coordinate axis.
2 x 3 win/remis/loss percent values,'move number calculating the Stockfish WDL statistic':
The percent values between 0 and 100 following each of the two move evaluations are not necessary, if
the 'move number' field is filled out with a number from 1. If the programme does not find correct percent
values, it then automatically calculates the 3 win/remis/loss percent values. If no move number is
entered, 2 percent values are sufficient, the third is calculated by the programme. Percent values that
exceed 100 in total or lead to an evaluation relevance that does not range from 0 to 1 are not accepted
by the programme.
'minimum evaluation > 0 for White win = 100% for calculating the WDL analyses #(.##)':
This value should empirically be around 2.35 when using the Stockfish engine; starting from this
evaluation, positions are statistically considered 100% won and from the negative amount of this
evaluation 100% lost; almost all WDL analyses in the programme are based on this value and gain
significantly in precision through it.
'evaluation at 0.75-game-res.-probab. (e=0.75) (##)#(.##)' (abbreviated
'e=0.75'):
This is the evaluation from the point of view of White on the horizontal x-coordinate axis ('x'), where
the average probabilistic game result is 0.75(:0.25) in favour of White. In the initial version of this
article it was referred to as 'win draw balance'. There, the evaluation relevance on the vertical
y-coordinate axis is 0.5.
'evaluat. at 0.75+-game-res.-prob. (e>0.75) > e=0.75 (##)#(.##)' (abbreviated
'e>0.75'):
This is the evaluation from the point of view of white on the horizontal x-coordinate axis ('x') where
the average probabilistic game result is higher than 0.75(:0.25) in favour of White. This evaluation is
higher than the previous one. Compared to the initialform, it represents a new parameter that helps to
achieve additional precision. It corresponds to the following last parameter. The evaluation relevance
on the vertical y-coordinate axis is less than 0.5 there.
'1.00 > 0.75-plus-game-result-probability > 0.75 0.#(####)' = 1 - (r>0.75 / 2):
This represents the average probabilistic game result in favour of White on the vertical y-coordinate
axis ('y') in the case of the previously entered evaluation 'e>0.75'. This result is
situated above 0.75(:0.25) and represents also a new parameter compared to the initial form, which helps
to additional precision.
Displaying the results requires permission to execute Javascript code in the browser.
Stellungsbewertungssymbole und Grenzwerte bei Anwender/Engine-WDL-BRR Grenzwert-Justierung an identischen Stellungsbewertungssektoren 9 Sektoren: 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9
Stellungsbewertungssymbole
Farbe
hohe Bewertung
suboptimale Bewertung
Anwender- BRR
Engine- WDL-BRR
Anwender- BRR
Engine- WDL-BRR
für Bewertung(en) — Weiß/Schwarz irrelevant
Grenzwerte
Anwender-BRR
Engine-WDL-BRR
klarer/extremer Vorteil Weiß (+– ⇒ ++–)
moderater/klarer Vorteil Weiß (± ⇒ +–)
leichter/moderater Vorteil Weiß (⩲ ⇒ ±)
Ausgleich/leichter Vorteil Weiß (= ⇒ ⩲)
Ausgleich/leichter Vorteil Schwarz (⩱ ⇐ =)
leichter/moderater Vorteil Schwarz (∓ ⇐ ⩱)
moderater/klarer Vorteil Schwarz (–+ ⇐ ∓)
klarer/extremer Vorteil Schwarz (––+ ⇐ –+)
Stellungsbewertungssymbole und Grenzwerte bei Anwender/Engine-WDL-BRR Grenzwert-Justierung an identischen Stellungsbewertungssektoren 7 Sektoren: 1/7 1/7 1/7 1/7 1/7 1/7 1/7
Stellungsbewertungssymbole
Farbe
hohe Bewertung
suboptimale Bewertung
Anwender- BRR
Engine- WDL-BRR
Anwender- BRR
Engine- WDL-BRR
für Bewertung(en) — Weiß/Schwarz irrelevant
Grenzwerte
Anwender-BRR
Engine-WDL-BRR
moderater/klarer Vorteil Weiß (± ⇒ +–)
leichter/moderater Vorteil Weiß (⩲ ⇒ ±)
Ausgleich/leichter Vorteil Weiß (= ⇒ ⩲)
Ausgleich/leichter Vorteil Schwarz (⩱ ⇐ =)
leichter/moderater Vorteil Schwarz (∓ ⇐ ⩱)
moderater/klarer Vorteil Schwarz (–+ ⇐ ∓)
Stellungsbewertungssymbole und Grenzwerte bei Anwender-BRR: Grenzwert-Justierung an probabilistischen Partieresultaten bei Engine-WDL-BRR: Grenzwert-Justierung an beiden Bewertungen 9 Sektoren
Stellungsbewertungssymbole
Farbe
hohe Bewertung
suboptimale Bewertung
Anwender- BRR
Engine- WDL-BRR
Anwender- BRR
Engine- WDL-BRR
für Bewertung(en) — Weiß/Schwarz irrelevant
Grenzwerte
Anwender-BRR
Engine-WDL-BRR
klarer/extremer Vorteil Weiß (+– ⇒ ++–)
moderater/klarer Vorteil Weiß (± ⇒ +–)
leichter/moderater Vorteil Weiß (⩲ ⇒ ±)
Ausgleich/leichter Vorteil Weiß (= ⇒ ⩲)
Ausgleich/leichter Vorteil Schwarz (⩱ ⇐ =)
leichter/moderater Vorteil Schwarz (∓ ⇐ ⩱)
moderater/klarer Vorteil Schwarz (–+ ⇐ ∓)
klarer/extremer Vorteil Schwarz (––+ ⇐ –+)
Stellungsbewertungssymbole und Grenzwerte bei Anwender-BRR: Grenzwert-Justierung an probabilistischen Partieresultaten bei Engine-WDL-BRR: Grenzwert-Justierung an beiden Bewertungen 7 Sektoren
Move and position evaluations together with NAG and Informator symbols
Chess players usually assess moves and positions on the board by using such symbols as follows:
brilliant move (‼) - NAG $ 3,
impressive move (!) - NAG $ 1,
attractive move (!?) - NAG $ 5,
questionable move (?!) - NAG $ 6,
weak move (?) - NAG $ 2,
miserable move (??) - NAG $ 4,
balanced position or draw (=) - NAG $ 10,
slight advantage for White (⩲ or +/=) - NAG $14,
slight advantage for Black (⩱ or =/+) - NAG $15,
moderate advantage for White (± or +/-) - NAG $16,
moderate advantage for Black (∓ or -/+) - NAG $17,
clear advantage for White (+-) - NAG $18,
clear advantage for Black (-+) - NAG $19,
extreme advantage for White (++-) - NAG $20,
extreme advantage for Black (--+) - NAG $21.
In addition the unclear position (∝) - NAG $13 should be mentioned. Actually it does not
belong here because it just states that (supposedly) a position evaluation is not possible.
Reservation: the above descriptions for all these symbols are own creations and of course not
binding. More information you will find
here.
Such move and position assessments are quite practical: they waste little space and at a glance
they reveal an evaluation range. Only the question arises, how such evaluations come about. Rule of
thumb? Or is there a bit more accurate way? It would be really an advance if they were defined by
any chess program evaluations in pawn units with which chess engines numerically express
positional imbalances, that is positional advantages or disadvantages. But where to take such
definitions if not steal? From which position evaluation of a chess engine one can, for example,
speak of a slight advantage for White, from 0.10 pawn units or from 0.20 – apart from individual
over- or understatements of the engines in the level of their evaluations? And how can an
objective scale be found here?
In the further course of this article, several mathematically derived proposals with
corresponding formulas will be submitted. Before that, however, various statistical and
mathematical foundations have to be worked out.
Blunder relevance or more polite: evaluation relevance
Chess programs usually rate positions in hundredths of pawn units, and comparing the spit out
variants of the number cruncher in one position reveals the evaluation difference and margin of
error respectively between the best and an inferior variant.
But how relevant are actually faulty moves and their evaluation? Example: in a lost position
after compensationless loss of the queen one gives away for no reason additionally another
figure. The chess program will acknowledge this mishap with a much higher evaluation in favour of
the opponent. But how relevant is such a difference between the new and the previous position
evaluation in a practically already lost game? Objectively – in other words apart from subjective
faulty moves of the opponent – in fact not at all! In all probability the bungler will not be
able to save the game even without the recent faulty move with best play on both sides.
To take it to the extreme: What is the objective threshold at which a game can objectively be considered won
or lost? Depends. One could ironically say: the more bungler the higher. The higher the evaluation, the sooner
one can count on the fact that the advantage will no longer be messed up, whereby one may invest more
confidence with today's chess computer programs than with Homo sapiens. And if you are dealing with a
potential patzer, you should not, for example, throw in the towel prematurely in an apparent position of loss,
as Kasparov formerly did in the 2nd match against Deep Blue in 1997.
Computer chess statistics
So what to do? One takes leave of the human bungler chess, turns to the strongest chess engines. Now, in principle, one can take
two paths:
Engine WDL ERR:
Since mid-2020, Stockfish has provided win/draw/loss ("WDL" for win-draw-loss) assessment ratios alongside the actual
evaluations. In the words of the Stockfish development team:
'UCI_ShowWDL
If enabled, show approximate WDL statistics as part of the engine output. These WDL numbers model expected game outcomes for a
given evaluation and game ply for engine self-play at fishtest LTC conditions (60+0.6s per game).'
These WDL statistics or probabilities consider the course of the game insofar as they take the evaluated half-move into account.
The formulas on which they are based can be found in the Stockfish program code
('win rate model'). The
use of these statistics need not necessarily be limited to game analyses made with Stockfish, because this engine is the
ultimate in positional analysis and therefore sets the standard for evaluation.
The highlight of the engine WDL statistics is not only the derivation of the evaluation relevancies and differences, optimum
rates, probabilistic game results, move and position evaluation symbols including thresholds discussed in this article in a
similar way as in the User ERR presented below. The average values resulting from it (cf. 'evaluation comparison user/engine
ERR', line 3, columns 3 and 6 in the programme) can provide valuable clues for adjusting the parameters of the User ERR.
By the way, here is a little programme trick: Entering '0' (zero) in the two move evaluation fields and the 'half-move' field
deletes the programme's internal memory for these two average values, which are retained when the parameters are loaded and
saved.
Experiments after the Stockfish update of 22 June 2023 suggest that there is an absolute win/loss probability of 100% at
Stockfish evaluations of approximately ±2.35. This value was used as the absolute relevance limit when calculating the WDL
values in the form above. Values beyond this do not play any role for relevant WDL evaluation differences, WDL move evaluation
symbols, WDL position evaluation symbols, etc. Furthermore, it was experimentally found that for evaluations of about 0.98, the
probabilistic game result for white is about 0.75 and for evaluations of about 1.18, the probabilistic game result for white is
about 0.875.
One small limitation should not go unmentioned. The automatic determination of the win, draw and loss percent values for an
evaluation by entering the half-move leads to results that differ slightly from those calculated by Stockfish itself. The
Stockfish code in the sub-programme 'win_rate_model' produces bizarre results outside of Stockfish. The variable 'v' there,
which is related to the evaluation, must be multiplied by an unknown factor. A few comparisons of the percentages coming from
the 'win_rate_model' code with the values produced directly by Stockfish suggest that this factor should be around 328. This
number 328 appears explicitly as 'NormalizeToPawnValue' elsewhere in the Stockfish code. The calculation parameters contained in
the latest update of the 'win_rate_model' of 22nd June 2023 produce results that can be reasonably reconciled with
the values produced by the Stockfish engine if the factor to be multiplied is raised to 330.3.
User ERR:
The traditional variant presented in this article is to analyse the engine games by asking at what evaluation these programmes
won their games - or not. The most meaningful games can probably be found on the Internet under
'TCEC' ('Top Chess Engine Championship') in the
'Superfinals'. Reasons: long thinking time, opponents were the two apparently best chess engines in each case and all position
evaluations are step by step comprehensible.
From a statistical point of view, is there a kind of 'point of no return', an evaluation - apart from a
concrete mate announcement, of course - from which the victory is undoubtedly settled and a drawing
liquidation can no longer be considered? Theoretically no. The following table shows that chess engines were
not able to convert in various TCEC Superfinals evaluations of up to 5.01 into victories. And nobody is able
to say where the absolute evaluation limit for such evaluation errors - best game assumed in the following
moves - can lie, since nobody is allowed to determine this limit with an infinite number of test games.
Even if such outliers happen very rarely, they prohibit the equation of any evaluation (even of 5.01 - as you
could see) with win or loss. In other words, there is no evaluation generated 'point of no return'.
Now one must turn to the question, in which evaluations special average game results have to be located. Of
particular interest seem to be evaluations where, once achieved, the average result of all games concerned
amounts to 0.75 (from White's point of view). Such a value can be achieved by an equal number of wins and
draws or by a number of defeats and a triple number of wins. For the sake of completeness, losses are also
mentioned here, although they rarely occur when this special balance evaluation is reached.
For clarification first of all the results of the Superfinals in the Seasons 9 ff. in tabular form as well as
the FIDE Candidates' Tournament 2018 with the evaluations of Stockfish 8 with a 30 seconds thinking time to be
found on 'www.chessbomb.com'.
tournament
analysis engine
wins
evaluation e=0.75 at average game result 0.75(:0.25)
maximum evaluation e>0.75 without win
average game result at maximum evaluation e>0.75 without win
alternative pair of values: evaluation e>0.75 / average game result ≅ 0.875
9
Stockfish
16
1.75
0.62
10
Houdini
15
2.00
0.66
12
Stockfish
29
1.48
0.52
13
Stockfish
16
2.79
1.14
14
Stockfish
10
2.42
1.45
FIDE Candidates' Tournament 2018
Stockfish 8
20
0.67
16.68
0.9762
2.39 / 0.8800
Superfinal TCEC Nr. 16
Stockfish 19092522
14
1.24
3.33
0.9667
1.65 / 0.8684
Superfinal TCEC Nr. 16
AllieStein v0.5-dev_7b41f8c-n11
5
3.96
8.18
0.9167
8.03 / 0.8571
Superfinal TCEC Nr. 17
LCZero v0.24-sv-t60-3010
17
1.34
5.01
0.9722
1.89 / 0.8810
Superfinal TCEC Nr. 17
Stockfish 20200407DC
12
1.49
2.76
0.9615
1.89 / 0.8750
Superfinal TCEC Nr. 18
Stockfish 202006170741
23
0.87
3.74
0.9792
1.41 / 0.8710
Superfinal TCEC Nr. 18
LCZero v0.25.1-svjio-t60-3972-mlh
16
0.69
2.12
0.9706
1.57 / 0.8636
The above analysis explained using the example of Superfinal No. 17 and the winning engine LCZero
v0.24-sv-t60-3010:
LCZero won 17 games. 83 games thus ended in draws or losses for LCZero. And in all these games is now the 17th
lowest evaluation to look for which LCZero indicated in his favour. Mind you, a positive evaluation that could
not be realized to win. So you count the 17 highest evaluations and the lowest of them is 1.26. Therefore 17
draws or losses exist in which in each case an evaluation of at least 1.26 is encountered. In other words:
LCZero achieved in 34 games an evaluation of 1.26 and in 17 games respectively, the result was a draw/loss or
1-0.
But a small complication is included in these numbers: LCZero had to acknowledge defeat in the 16th game,
although it had already spat out an evaluation of 1.89 and 1.89 is situated above the previously determined
evaluation threshold of 1.26. Because of this '0'- result, it is not possible to determine an average game
result of 0.75 based on the actual results. Because this amounts to
instead of 0.75. If the real numbers are stubborn, mathematics must intervene. The formula for the average
game results between 0.5 and 0.75 on the y-coordinate axis is a linear function and is
Sought-after is the ominous 0.75 game result evaluation (abbreviated 'e=0.75'). So it must be
transformed:
In the present LCZero case, therefore, it is to be calculated:
The result is situated slightly above the actual e=0.75, which was to be expected.
In September 2017 the engine Houdini 6 was released. You can read the following on this
website:
'The evaluations have again been calibrated to correlate directly with the win expectancy in the
position. A +1.00 pawn advantage gives a 75% chance of winning the game against an equal opponent
at blitz time control. At +1.50 the engine will win 90% of the time, and at +2.50 about 99% of
the time. To win nearly 50% of the time, you need and advantage of about +0.60 pawn.'
Houdini kept his word. In the TCEC Superfinal Season 10 against Komodo, Houdini gained 15 victories and in the
15 draws or losses with the highest evaluations of Houdini, the minimum evaluation was 0.57. An almost precise
landing.
The above table allows the cautious conclusion to be drawn that the Stockfish versions used since the TCEC
Superfinal 13 give significantly higher evaluations than their previous versions. One thing should not fall by
the wayside when interpreting these results: Stockfish 10 was given a 'contempt' of 0.24 (Stockfish 9: 0.20),
which should raise the respective evaluation. It therefore seems obvious to subtract this contempt margin from
the evaluation thresholds listed in the table for one's own analysis purposes.A tip, however, is allowed:
Analyses with Stockfish should only be carried out with ‘contempt’ switched off in order not to artificially
drive up the evaluations.
Finally it should be noted that the TCEC website has recently been updated with win draw probabilities and
locates for the engine Stockfish the e=0.75 with around 1.56 (Superfinal 17) or even 1.91 (Superfinal 18). In view
of the previous table a quite plausible value. However, it is critical that there only percentages for 'W'
(win?) and 'D' (draw? - 100% - 'W' percentage) are given, but that the loss probability is swept under the
carpet. The TCEC-e=0.75 of 1.56 assumed above is necessarily based on the assumption that 'D' also includes the
probability of loss.
Mathematical evaluation relevance reduction
Let's note: on the way of evaluation between 0.00 and infinity (∞) its relevance decreases continuously.
Starting at 100% in the case of a 0.00 evaluation over 50% at the e=0.75 evaluation (the TCEC value of 1.56 is
assumed below as an example for clarification) it ends at infinity with 0%.
One example: the evaluation for the best move is 2.00. Now a mishap happens: a faulty move because of a figure
loss with an evaluation of -3.00. The absolute evaluation difference is -5.00. How relevant is this figure
loss? Obviously less than -5.00.
In detail:
between the evaluations 2.00 and 1.56 the relevance of less than 50% is growing continuously;
at 1.56 it should amount to 50%; for this is the mean value between 100% and 0%; furthermore, the
probabilistic game result of 0.75 at the evaluation 1.56 is the mean value between 0.5 at the 0.00 evaluation
and 1 at a maximum engine evaluation;
at 0.00 the relevance reaches its maximum value of 100%;
-1.56 is again resulting in 50% and
at -3.00 it ends with a value well below 50%.
The sum of these percentages would now be interesting. By way of calculation doable, but somewhat complicated.
The mathematical adepts have certainly long recognized that this up and down would have to be expressed with a
mathematical function, for which the following applies: The more one moves away from the y-axis on both sides,
the smaller the ordinates, the respective evaluation relevance amounts along these points on the x-axis, until
they finally approach the x-axis on both sides at infinity as asymptotes. The x-axis thus represents the
evaluations (on the part of an engine), the y-axis the evaluation relevance amounts.
At this point an exponential function of the general form f(x) = a^(x*b) was proposed in the first article
version. Such exponential functions have the advantage that the point P(0;1) is always fulfilled and they
approach the x-axis in (positive) infinity. The disadvantage of such a function, however, includes the fact
that it can only determine 2 points, the already mentioned point P(0;1) and the point P(e=0.75;0.5). However, a
further definition point P(e>0.75;r>0.75) would be urgently needed for better precision, for example to be able to
capture the highest TCEC engine evaluations without win and the corresponding game results which are far above
0.75.
Solution: 3 equations for 3 negative and 3 positive sectors along the x-axis (x stands for engine evaluation):
1st positive and negative sector:
linear equation with 𝔻 {x | -e=0.75 ≤ x ≤ e=0.75}
2nd positive and negative sector:
linear equation with 𝔻 {x | -e>0.75 ≤ x ≤ -e=0.75 or e=0.75 ≤ x ≤ e>0.75}
3rd positive and negative sector:
exponential equation with 𝔻: {x | -∞ < x ≤ -e>0.75 or e>0.75 ≤ x < ∞}
The evaluation relevance functions are set. But how is the really relevant evaluation difference calculated
over a certain distance on the x-axis, for example between 2.00 and -3.00? The evaluation relevance function
only returns the respective y-value of a special point along the x-axis. As ingenious as it is simple: via
integral function. All values between the x-axis and the function curve summed, i.e. the area there, between
the best evaluation (for example 2.00) and the inferior evaluation (for example -3.00) represent the definite
integral of this function - i.e. the relevant evaluation difference.
To calculate the integral, the antiderivatives of the evaluation relevance functions are required. These are
as follows:
1st positive and negative sector:
quadratic equation with 𝔻 {x | -e=0.75 ≤ x ≤ e=0.75}
2nd positive and negative sector:
quadratic equation with 𝔻 {x | e=0.75 ≤ x ≤ e>0.75}
3rd positive sector:
exponential equation with 𝔻 {x | e>0.75 ≤ x < ∞}
3rd negative sector:
exponential equation with 𝔻 {x | -∞ < x ≤ -e>0.75}
Note with the equations above, that the computer program Maxima uses the notation log(x) instead of the usual
notation ln(x) for the natural logarithm. By the way also Javascript ('Math.log()'). If the above equations
with 'ln' should be used in such programs, 'ln' would have to be replaced by 'log'.
If you experiment with the interactive form above, you will soon realize that in extreme evaluations the
relevant evaluation difference hardly changes when these evaluations are entered even more extreme. Example
for White:
high evaluation = 15
suboptimal evaluation = 0
e=0.75 = 2
e>0.75 = 3
probabilistic game result at e>0.75 = 0.85 (corresponds to a r>0.75 = 0.3)
result of the relevant evaluation difference = 2.64
If the high evaluation is increased to 18, the relevant evaluation difference amounts to 2.65. And a high
evaluation of 1000 results in a relevant evaluation difference of 2.65. The same results occur if the suboptimal
evaluation is -15, -18 or -1000 and the high evaluation amounts to 0.
The relevant evaluation differences are rounded up or down to 2 decimal places in the form. If you now want to
calculate the high or suboptimal evaluation (in future called ‘irrelevance start evaluation’), from which every
further increase or reduction to infinity will lead to an increase of the relevant evaluation difference (with
2 decimal places) by 0.01 at some point with a maximum probability of 50%, you need the following formula:
solved according to irrelevance start evaluation and taking into account high and low results ("±"):
The result with the above parameters amounts to ±15.477.
The formula shows that the irrelevance start evaluation is independent of the 2nd evaluation (0 in the above
case) and of e=0.75 (2 in the above case).
This formula normally applies to the localization of the irrelevance start evaluation in the 3rd positive and
negative sector. With unusual values of e>0.75 and r>0.75, the irrelevance start
evaluation slips into the 2nd positive and negative sector, so that far more complicated formulas are
required. This happens when the following applies:
For example, if e>0.75 < 0.978 and the probabilistic game result at e>0.75 = 0.99. Or if e>0.75 < 0.0277 and the
probabilistic game result at e>0.75 = 0.875. Highly unrealistic!
The evaluation relevance reduction and all the delicacies mentioned in this article (automatic move and
position evaluation symbols as well as the probabilistic game results) have been implemented
in the ScpcPGN program, available free of charge on this
website
and in the AquaPGN program (latest update 12th August 2020), available free of charge on this
website.
Probabilistic game results
Why is there talk of 'probabilistic' game results? Because they are derived from an engine evaluation and
other parameters and therefore contain a stochastic statement about the presumed average game outcome. The
situation was different in the discussion of the TCEC results, where only the 'average' game results were
mentioned, because there was game material available with which the factual average game results could be
calculated.
The probabilistic game result is always presented here from the point of view of White. If White wins, the
result is 1-0, vice versa 0-1, and in the case of a draw ½-½. If you take the leading number in each case, you
have the probabilistic game result used here.
It can be derived directly from the evaluation relevance:
for positive evaluations:
probabilistic game result = 1 - (evaluation relevance / 2);
for negative evaluations:
probabilistic game result = evaluation relevance / 2.
An engine evaluation of exactly 0.00 with an evaluation relevance of 1.00 results in a probabilistic game
result of 0.50, i.e. a presumed draw. A probabilistic game result of approximately 1.00 would be an almost
certain win for White, and one of approximately 0.00 would be an almost certain win for Black. 1.00 and 0.00
are mathematically never exactly reached. And an engine rating of exactly e=0.75 leads to a result of 0.75, i.e. a
value that lies exactly between win for White and Draw. The results are therefore easier to interpret from
White's point of view.
Clarification: the probabilistic game result is in no way equivalent to a win probability.
Many people make this mistake. For example, the programme Nibbler manages to confuse the - in reality -
probabilistic game result with the 'Winrate', although in the starting position after 1. e4, for example,
this 'Winrate' exceeds 50%, while the actual win probability in the 'WDL' display is only a modest 15%. But
the programme author apparently does not notice this.
It applies lapidary:
In order to fulfill the duty of a chronicler also the game result equations:
1st positive und negative sector:
linear equation with 𝔻 {x | -e=0.75 ≤ x ≤ e=0.75}
2nd positive sector:
linear equation with 𝔻 {x | e=0.75 ≤ x ≤ e>0.75}
2nd negative sector:
linear equation with 𝔻 {x | -e>0.75 ≤ x ≤ -e=0.75}
3rd positive sector:
exponential equation with 𝔻 {x | e>0.75 ≤ x < ∞}
3rd negative sector:
exponential equation with 𝔻 {x | -∞ < x ≤ -e>0.75}
Of course, the probabilistic game results can also be found in the interactive form.
How you should not do it though:
Sune Fischer and Pradu Kannan have examined the mathematical relation between 'winning probability W and the
pawn advantage P' in the article
'Pawn
Advantage, Win Percentage, and Elo'.
Whether 'winning probability' really means the real (lower) ‘winning probability’ or perhaps only the (higher,
since draws are taken into account) probabilistic game result can be deduced from the article elsewhere:
'When applying the condition that the win probability is 0.5 if there is no pawn advantage …'
If ‘the win probability is 0.5’ and the 'pawn advantage' is zero, the loss probability would necessarily also
have to be 0.5 in order to evaluate the position as balanced. But where then are the draws, which should
approach with a win probability of 50% this mark, with low loss probability?! It seems that the authors'
knowledge of chess game is quite limited. This nonsense must therefore be corrected to the effect that the
authors are not referring to the 'win probability’ but to the probabilistic game result discussed in this
article, which includes draws and losses. This is how the calculation works: A probabilistic game result of
0.5 is equivalent to an evaluation – or if you like a 'pawn advantage' – of 0.00.
'Data was taken from a collection of 405,460 computer games in PGN format. Whenever exactly 5 plys in a game
had gone by without captures, the game result was accumulated twice in a table indexed by the material
configuration. … Only data pertaining to the material configuration was taken. This was considered reasonable
because the material configuration is the most important quantity that affects the result of a game.'
If by 'material configuration' the material balance is meant as the difference of the mutual figure values is
to be assumed, because it is stated elsewhere:
'For each material configuration, a pawn value was computed using conventional pawn-normalized material ratios
that are close to those used in strong chess programs (P=1, N=4, B=4.1, R=6, Q=12).'
Apart from the fact that these figure values seem to be quite generous, the material balance is very coarse
compared to the evaluations of chess engines, which are based on much more difficile criteria and last but not
least on considerable search depths. But all this would still be bearable if the relation between win
probability and figure balance presented by the authors were stringent. Meanwhile, an ominous parameter 'K'
appears in their ultimate formula:
And they estimate this parameter 'K' at '4' – roughly.
If you resolve this formula to K, you get:
And if you insert into this formula, for example, the Ps and Ws determined above for the winning engines of
TCEC 17 (LCZero) and 18 (Stockfish), you get very different Ks between 1.7 and 3.2.
Conversely, a K of 4 with a probabilistic game result of 0.75 would result in an evaluation of 1.91, which is not very realistic
according to the above table values. This assessment is confirmed by the following test: Determine within the Stockfish WDL
calculation the evaluations for different half-moves, each with a probabilistic game result of 0.75. One obtains
in half-move 1 an evaluation of 1.50,
in half-move 10 an evaluation of 1.40,
in half-move 100 an evaluation of 1.15
and never an evaluation of 1.91.
Obviously, it is illusory to try to mathematically force the desired
relation into a single sigmoid function with only one parameter ('K'). In contrast, the form 'Interactive
Evaluation Relevance Reduction' presented at the beginning of this article works to calculate the probabilistic game results in
the user ERR with a total of 5 formulas and 3 parameters and in the Stockfish WDL with very accurate win, draw and loss
probabilities. Precision instead of simplification!
Concretisation of the move evaluation sectors
It may seem tasteless to derive in the following the move evaluation symbols quasi automated from
engine evaluations, as they are often chosen based on a deeper understanding of the position and
are not oriented towards engine evaluations. Example: In a position there is quite clearly only
one reasonable move that every child can find, all other moves would be miserable. It would be
more than stupid to attest this one move the quality feature '‼'. Or a little more subtle: In
lost position, a move that is objectively weak, i.e. theoretically refutable, is setting a trap
that holds the chance of revival. A typical "interesting move (!?) - NAG $5", which should
perhaps not be characterized with "?" or the like. Nevertheless, in many cases it can make sense
by all means to determine such move evaluation symbols from a comparison of the engine
evaluations for two alternative moves, especially if there is no opportunity to examine a
position more carefully, for example in automatic game analyses.
The intention of Grandmaster Robert Hübner cannot be followed in this way. In the
English-speaking Wikipedia he is quoted as follows:
'German grandmaster Robert Hübner prefers an even more specific and restrained use of move evaluation symbols:
'I have attached question marks to the moves which change a winning position into a drawn game, or a drawn
position into a losing one, according to my judgment; a move which changes a winning game into a losing one
deserves two question marks ...‘'
Uncertain assessments such as 'winning position', 'drawn game', 'drawn position' or 'losing one' do not become
more suitable for programming by the addition 'according to my judgment'.
The starting point for the classification of the move evaluation symbol is, of course, the real made move, on
the other hand the best alternative move for bad moves and the second-best alternative move for good moves.
For these two moves - as explained above - the relevance reduced evaluation difference has to be determined
and this in turn has to be translated into the move evaluation symbol. Thereby the definite integral of the
entire evaluation range from -∞ to +∞ divided into not only 6, but 7 or 8 sectors of equal size. There are not
only the 6 sectors for which a move evaluation symbol is to be assigned, but also the neutral sector of a move
that is approximately equal to the best or second-best move. Half of this neutral sector comes in the positive
evaluation direction and half in the negative. One can either use a neutral sector with the same integral size
as the remaining sectors or a neutral sector twice as large, consisting of 2 sectors with the usual integral
size, one for each evaluation direction. This would total to either 7 or 8 equal integral areas (on the latter
variant 2 integral sectors for the neutral sector).
Mind you: We are talking here about integral sectors and sizes respectively in the sense of
definite integrals, i.e. the relevant evaluation differences, not to be confused with the
absolute differences between 2 move evaluations on the x-axis. For a given relevant evaluation
difference, the latter are quite different, depending on where the move evaluations are located
on the x-axis. The further they move away from the y-axis, i.e. from the move evaluation 0.00,
the more their distance to each other increases with a given relevant evaluation difference.
Mathematically it is even possible - based on von e=0.75, e>0.75, r>0.75 as well as a given move evaluation - to
calculate that limit value of a new move evaluation which would result in case of a move with any move
evaluation symbol. Hard to understand, therefore an example: Given is a faulty move of White with an
evaluation of -0.30 and a e=0.75 of 1.50, a e>0.75 of 3.00 and a probabilistic game result at e>0.75 of 0.875. From
which evaluation would an alternative good move of White compared to this weak and at the same time next best
move deserve the move evaluation symbol '‼'? Depending on the used scheme, the answer will be for
example 1.52 or 1.62.
Of course, such move evaluation symbols only come into effect if correspondingly high definite integrals -
pardon: relevant evaluation differences - are available at all. A correct move of white with an engine
evaluation of 100.00 will hardly earn a'!?','!' or '‼', even if the second-best move is only 10.00. This
positive evaluation difference is simply irrelevant and is therefore confirmed with a relevant evaluation
difference of almost 0.00. A won position can usually not be spoiled with the second best moves. That is just
the effect of the evaluation relevance reduction.
How big should these relevant evaluation differences for the move evaluation symbols turn out
now? Possibly with the exception of the neutral sector, the entire integral area could be divided
into equal parts, or the subdivision could be aligned in that way that a brilliant move can
already be stated if it exceeds the win draw balance and the next best move has to make do with
an evaluation of 0.00. The first alternative deals with the move evaluation symbols more
economically, the second is more generous.
Here the relevant evaluation difference between the initial evaluation and the threshold for reaching the move
evaluation symbol is
brilliant move (‼) - 1/14 + 1/7 + 1/7 = 5/14 of the total integral towards a better evaluation,
impressive move (!) - 1/14 + 1/7 = 3/14 of the total integral towards a better evaluation,
attractive move (!?) - 1/14 of the total integral towards a better evaluation,
questionable move (?!) - 1/14 of the total integral towards a suboptimal evaluation,
weak move (?) - 1/14 + 1/7 = 3/14 of the total integral towards a suboptimal evaluation and
miserable move (??) - 1/14 + 1/7 + 1/7 = 5/14 of the total integral towards a suboptimal evaluation.
From this, the thresholds of the move evaluations for White and Black can now be calculated with formulas
which are not shown here, but are available in a browser inspector via Javascript code.
Generally, move evaluation symbols are assigned more generously here than in the following scheme '1/8 1/8 1/8
1/8 1/8 1/8 1/8 1/8 1/8'.
Here the relevant evaluation difference between the initial evaluation and the threshold for reaching the move
evaluation symbol is
brilliant move (‼) - 1/8 + 1/8 + 1/8 = 3/8 of the total integral towards a better evaluation,
impressive move (!) - 1/8 + 1/8 = 1/4 of the total integral towards a better evaluation,
attractive move (!?) - 1/8 of the total integral towards a better evaluation,
questionable move (?!) - 1/8 of the total integral towards a suboptimal evaluation,
weak move (?) - 1/8 + 1/8 = 1/4 of the total integral towards a suboptimal evaluation and
miserable move (??) - 1/8 + 1/8 + 1/8 = 3/8 of the total integral towards a suboptimal evaluation.
Generally, move evaluation symbols are assigned less generously here than in the previous scheme '1/7 1/7 1/7
1/14 1/14 1/7 1/7 1/7 1/7'.
In the interactive form, the thresholds between the symbols in both scheme tables are listed, as far as
algebra allows it, i.e. as far as the margin of relevant evaluation difference remaining after the initial
evaluation allows an award. If not, the character string '-----' is output.
Optimum rate:
In the results under the form 'Interactive Evaluation Relevance Reduction' you will also find the 'optimum
rate of the suboptimal move evaluation'. This contains the precise numerical expression for the move evaluation
symbol of the suboptimal move evaluation (nothing, ?!, ?, ??).
It is calculated as follows:
1 - (relevant evaluation difference / total integral)
The total integral is the definite integral over the entire x-axis with the evaluations from -∞ to +∞.
So the optimum rate is regularly below 100% and reaches the optimum of 100% only exceptionally with 2 move evaluations without
relevant evaluation difference.
Concretisation of the position evaluation sectors
The 9 evaluation sectors listed at the beginning of the article can now be described in more detail using the
developed mathematical foundations. Four evaluation sectors in each case are positive and negative. The
balanced position shall apply to minimal advantages for White and Black around the value zero. The sector of
the minimum advantage for White or Black is 50% of the total balanced sector.
9 position evaluation sectors with threshold adjustment at the probabilistic game results:
Here an assumption takes place that is not mandatory, but very plausible: The end of the sector 'moderate
advantage for White' and the beginning of the sector 'clear advantage for White' should coincide exactly with
the evaluation e=0.75 for which the probabilistic game result amounts to 0.75. Conversely for Black: The end of
the sector 'moderate advantage for Black' and the beginning of the sector 'clear advantage for Black' should
coincide exactly with the evaluation -e=0.75, for which the probabilistic game result amounts to 0.25 from White's
point of view. With this basic assumption is accompanied that a slight or moderate advantage probabilistically
represents a tendency to draw and a clear or extreme advantage probabilistically represents a tendency to
win.
Further assumption: The end of the sector 'clear advantage for White' and the beginning of the sector 'extreme
advantage for White' should exactly coincide with the evaluation e>0.75. Conversely for Black: The end of the
sector 'clear advantage for Black' and the beginning of the sector 'extreme advantage for Black' should
coincide exactly with the evaluation -e>0.75.
When using this scheme, it would probably be useful to adjust the probabilistic game result at e>0.75 to 0.875,
so that it lies exactly in the middle between 0.75 and 1.00.
Now some mathematics again:
The task now is to quantify these individual advantage sectors. For example if one would compare a white move
with an overwhelming advantage of 100.00 to a patzer move leading to a draw (0.00), the absolute evaluation
difference would be 100.00, but the relevant valuation difference would be only the practically complete
definite integral of all functions in the exclusively positive range of the x-axis (which in turn is identical
to the definite integral in the exclusively negative range of the x-axis).
By the way, the mathematical formula for the complete integral from -∞ to +∞ is:
Next thought experiment: If one would now compare a white move with an advantage of e=0.75 exactly at the border
between moderate and clear advantage with a patzer move that leads to a draw (0.00), the absolute evaluation
difference would be e=0.75, but the relevant evaluation difference would only be the complete definite integral in
the 1st positive sector of the x-axis. As mathematical formula: 0.75 * e=0.75.
If one now sets to work to quantify the definite integrals between x = 0 and begin of the slight
advantage, between the latter and begin of the moderate advantage and again between the latter
and the begin of the clear advantage in each case for White/Black, the integral value of 0,75 * e=0.75 would have
to be divided into 3 sectors:
20% = 0.15 * e=0.75 for the sector balanced position from 0.00,
40% = 0.30 * e=0.75 for the sector slight advantage for White/Black and
40% = 0.30 * e=0.75 for the sector moderate advantage for White/Black.
From this, the thresholds of the position evaluations for White and Black can now be calculated with formulas
which are not shown here, but are available in a browser inspector via Javascript code.
7 position evaluation sectors with threshold adjustment at the probabilistic game results:
'Extreme advantage for White (+--) or Black (-++) - NAG $20/$21' may not be everyone’s cup of tea. For these
contemporaries now a now a repetition of the previous proposal, but this time with only 7 evaluation sectors
without extremes.
Here, the end of the sector 'slight advantage for White' and the beginning of the sector 'moderate advantage
for White' coincide exactly with e=0.75, for which the probabilistic game result amounts to 0.75, and the end of
the sector 'moderate advantage for White' and the beginning of the sector 'clear advantage for White' coincide
exactly with e>0.75. Conversely for Black: The end of the sector 'slight advantage for Black' and the beginning
of the sector 'moderate advantage for Black' coincide exactly with -e=0.75, for which the probabilistic game
result amounts to 0.25 from the white point of view, and the end of the sector 'moderate advantage for Black'
and the beginning of the sector 'clear advantage for Black' coincide exactly with the evaluation -e>0.75. This
basic assumption is accompanied by the fact that slight or moderate advantage probabilistically represents a
tendency to draw and clear advantage probabilistically represents a tendency to win.
If one here sets to work to quantify the definite integrals between x = 0 and begin of the slight advantage
and between the latter and begin of the moderate advantage in each case for White/Black, the integral value
of 0,75 * e=0.75 would have to be divided into 2 sectors:
1/3 = 0.25 * e=0.75 for the sector balanced position from 0.00 and
2/3 = 0.50 * e=0.75 for the sector slight advantage for White/Black.
9 position evaluation sectors with threshold adjustment at identical evaluation sectors 1/9 1/9 1/9
1/9 1/18 1/18 1/9 1/9 1/9 1/9 of the total integral:
If one discards the above guideline of threshold adjustment at the probabilistic game results and again
prefers 4.5 positive or negative position evaluation sectors this time, however, of equal quantity, the
evaluation areas would turn out as shares of the total integral as follows:
1/18 for the sector balanced position from 0.00,
1/9 for the sector slight advantage for White/Black,
1/9 for the sector moderate advantage for White/Black,
1/9 for the sector clear advantage for White/Black and
1/9 for the sector extreme advantage for White/Black.
7 position evaluation sectors with threshold adjustment at identical evaluation sectors 1/7 1/7 1/7
1/14 1/14 1/7 1/7 1/7 of the total integral:
If one discards the above guideline of threshold adjustment at the probabilistic game results and is also not
a friend of 4.5 positive or negative position evaluation sectors with extremes, this scheme with sectors of
equal quantity remains:
1/14 for the sector balanced position from 0.00,
1/7 for the sector slight advantage for White/Black,
1/7 for the sector moderate advantage for White/Black and
1/7 for the sector clear advantage for White/Black.
The interactive form lists the position evaluation symbols and the limit values between the symbols, the
latter in a separate line for each of the 4 schemes.
A tip by the way: If the well-disposed reader strives to use the position evaluation symbols,
however not getting hold of them, the following link to the
AqChessUnicode font could be helpful. This by the way is also attached to the
chess GUI Aquarium.
An evaluation of around 1.50 pawn units to achieve an average game result of 0.75 applies to a largely optimal
chess play, as the best chess engines practice it nowadays, but not necessarily also for chess players, not
even for grandmasters, who also play far too often bullshit and should therefore theoretically make do with a
clearly higher e=0.75. The reason for this would be their tendency to make mistakes, which lets them draw or even
lose games which were believed to be already won. One objection to this, however, is the fact that this
measured value would be pushed down again by the mistakes of their opponents of the genus Homo sapiens,
because their mistakes often lead to wins which were not necessarily inevitable and for good chess engines
such positions under pressure might have been defensible. In this way, many actual draws with temporarily high
evaluations could be statistically included in the number of wins without pushing the e=0.75 up or, vice versa,
even minimizing it, since with every additional win a lower evaluation in the waiting list rises to the new
e=0.75. In this respect a suboptimal chess play would be upgraded by the suboptimal opponent's play. Which impact
of the chess playing Homo sapiens for the e=0.75 will take more effect is uncertain.
If chess grandmasters still had the guts to face the best chess engines, their true e=0.75 might not be determined
either. After all, when would they have a clear advantage in such games or even carry off wins? Maybe in
extreme handicap games? With them it could be tested how many pawns would have to be taken away from the
computer opponent in the starting position in order to wangle wins and draws on a significant scale for the
got off scot-free master. Or how a given opening would have to be constructed to release the chess engine into
a questionable position. So the grandmasterly e=0.75 could be determined after all. But since contemporary chess
luminaries have been avoiding such comparisons more and more for a long time in order to escape disgrace, such
a question hardly arises any more.
Since such game material from matches between man and machine is hardly available, there is
currently and presumably also for eternal times only left the half-baked possibility to evaluate
games between humans. Whereby one should always keep in mind that the resulting scores were
diluted by the dubious playing style of the opponent. Forget it.
No sooner said than done by analysis of 144 world championship matches between Karpov and
Kasparov in the years 1984 to 1990. The very last game remains unconsidered, since Kasparov
settled a draw with Karpov there with a clear advantage, although the win - as it says in chess
slang - was only a question of technique. A draw was enough for him to win the world championship
title. All games were superficially analyzed by Stockfish with a short reflection time and an
average depth of just over 20 half moves.
To make a long story short: Kasparov won 21 times, Karpov 19 times. The 21 and 19 highest
evaluations respectively in draw games were between 3.67 and 1.00 for Kasparov and between 7.80
and 1.04 for Karpov. If you like, you can read from this a win draw balance of at least 1.00 …
Despite a positive evaluation of at least 1.26, the game was still set in the sand in 5 games. Kasparov even
messed up the 18th match in the 1986 world championship fight despite a clear 3.67!
Excursus: "Draw range"
At this point the term "draw range", which is wandering around like a ghost from time to time,
should be critically scrutinized a little. Because it suggests wrongly that it would coincide
with the evaluation range "balanced position or draw (=) - NAG $10". To the reader's chagrin,
however, there comes to light a pretty different understanding of this term.
"Therefore one believes with Houdini that a (won) endgame is still in the draw range when it
shows +0.80..." (Schachfeld).
This suggests that on the basis of a position evaluation of a chess engine in the low range a
statement could be made about the draw outcome of the game. Clearly every game win starts small,
namely with a minimal advantage, even maybe after the first move. And if one position the chess
engine on the first moves after a game won in this way and let yourself be convinced that the
game did by no way start with an initial advantage of significantly more than +0.80, you might
start to think long and hard. And the argumentative counter attack by later failures, which are
said to have caused the disaster, won’t work, if the patzer is called Stockfish for example and
has an ELO of approximately 3500. Take to heart the TCEC loss games of Stockfish. There you will
find a lot of games that ended in disaster for this engine despite a negative "draw range" of
about -0.76 or -0.80, although it is not well-known for negligently dealing with its positions
within the alleged "draw range". Who else but Stockfish should be able to keep such positions in
a draw?
Variant 2:
"If during a game no side has a winning advantage, it is also said that "the game is within the
draw range"." (Wikipedia).
'Draw range
Scope for a position evaluation, which will lead in the end with the best possible play on both
sides to a draw. In the example, White is worse, but is still in the draw range, because he can
prevent the pawn from promotion with his king. But if he had the idea to play 1.Kh1, e.g. hoping
for 1...f2 and stalemate, he would have left the draw range and Black could now force the victory
with best play, namely by 1...Kg4 including gain of the opposition. Whether the starting position
of the chess game is in the draw range, or whether perhaps White could force the victory, is too
complex to be answered." (www.schwachspieler.de).
Here, the term "draw range" is associated with an ominous "scope for a position evaluation" in
the case of a forced draw by certain moves with the best play, which can apparently be proven. In
connection with a demonstrable draw, however, even to utter the word "range" is a sign of
distorted logic. The draw is 0.00, nothing else. In this case a chess program would have to
deliver not only a position evaluation of 0.00, but also one or more draw variants, which are
mandatory according to the rules of logic or according to endgame tablebases. This only works in
special positions, especially in all maximum 7-man positions, which are completely analysed, all
others are simply so complex that one has to be content with a position evaluation between zero
and checkmate without being able to draw any compelling conclusions about the outcome of the
game. And if a chess program in a real draw position would show a rubbish evaluation differing
from 0.00, the program would have a code problem and this would not justify the alogical term
"draw range".
If, as usual, a draw would not be provable, one should certainly not use the term "draw range" to
lead the reader to believe in would-be knowledge that one cannot have in view of the complexity
of a chess game. Then only statistics/probabilistics (the actual topic of this article) govern with regard to
all considerations about the outcome of the game and opening databases with win, draw and loss
rates of one and the same position can tell a tale about it.
Contact: mail@konrod.info
Ende Gelände ♦ Aus die Maus ♦ Schicht im Schacht ♦ Klappe zu - Affe tot
So long ♦ See You Later, Alligator - In A While, Crocodile ♦ Over And Out