[Computer-go] Rating stability study

Brian Sheppard Mon, 29 Aug 2011 07:14:44 -0700

Pebbles rating on 9x9 CGOS swung wildly last week, from 2310 up to 2400 and
down to 2340 or so. The magnitude of the swing seemed odd to me, so I
investigated.


It turns out that Pebbles rating depends heavily on whom it is playing.

Versions of Valkyria and Mogo had the following winning percentages against
Pebbles:

        Valkyria3.5.9_P4Bx (2641)       138/180 = 76.7% (Some games against
Valkyria were played before last week)
        _Mogo3MC90K,p (2393)            8/20 = 40.0%
        _Mogo3MC30K,p (2274)            21/86 = 24.4%

Pebbles combined performance rating against these programs was 2447, on 286
games.

In contrast, Fuego and Aya won very high percentages against Pebbles:

        Fuego-1502M-1c (2621)   14/16   = 87.5%
        Fuego-1502-1c (2602)    26/28   = 92.9%
        Fuego-0.4.1,p (2492)    21/24   = 87.5%
        Aya727j_10k (2291)      32/53   = 60.4%

The combined performance rating of Pebbles in these games is just 2207, on
121 games.

There is zero probability that a 2207 will play at 2447 level for 286 games,
or that 2447 will play at 2207 level for 121 games. But note that there is
hard-to-quantify selection bias in the way these ratings are calculated.

Pebbles run from 2310 to 2400 coincided with a period when no versions of
Fuego or Aya were running, but two Mogos and one Valkyria were running. Then
four versions of Fuego plus an Aya logged in, and Pebbles rating dropped
fast.

A few years ago I had the impression that Pebbles did relatively well
against Fuego, but could not win a game against Mogo. So this relationship
has changed over time.

I haven't tuned Pebbles to play well against any specific opponents. I doubt
that anyone is targeting Pebbles.

It is clear that some programs frequently play specific openings. If Pebbles
plays an opening badly, then it would lose a lot. The result could also
arise from differences in understanding that push the game in a specific
direction.

One notable non-transitive relationship is that

        _Mogo3MC90K,p won 70% against Aya727j_10k (note: only 20 games, but
result is consistent with rating)
        Aya727j_10k won 60% against Pebbles
        Pebbles won 75% against _Mogo3MC90K,p

My conclusion is that I am glad that I stopped using CGOS ratings as a way
to measure short-term progress. At a minimum I would have to re-weight games
to represent a consistent opposition profile.

Brian


_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

[Computer-go] Rating stability study

Reply via email to