Hi all,

I have written a TLA+ specification for the replication protocol. It models
the life of a single ledger and includes two writers, the original writer
that creates the ledger and a second writer that can at any point (or not)
start the recovery process.

It found one problem with the fencing that allowed for a data loss scenario
to occur. You can find the details of that particular problem in issue 2614
(https://github.com/apache/bookkeeper/issues/2614).

The specification is here: https://github.com/Vanlightly/bookkeeper-tlaplus

The state space for a 4 broker deployment with a Qw 3, Qa 2 ledger is
pretty large and takes days to run on a 96 core beast. However, you can run
it with smaller ensemble sizes to run in mere hours.

Check it out if you're interested in formal verification. Also if there are
other aspects of BookKeeper that could benefit from verification let me
know.

Best regards
Jack


-- 
*Jack Vanlightly*
Principal Software Engineer
Splunk Inc.
jvanligh...@splunk.com <kramas...@splunk.com>
Barcelona

Reply via email to