Hi all, I have written a TLA+ specification for the replication protocol. It models the life of a single ledger and includes two writers, the original writer that creates the ledger and a second writer that can at any point (or not) start the recovery process.
It found one problem with the fencing that allowed for a data loss scenario to occur. You can find the details of that particular problem in issue 2614 (https://github.com/apache/bookkeeper/issues/2614). The specification is here: https://github.com/Vanlightly/bookkeeper-tlaplus The state space for a 4 broker deployment with a Qw 3, Qa 2 ledger is pretty large and takes days to run on a 96 core beast. However, you can run it with smaller ensemble sizes to run in mere hours. Check it out if you're interested in formal verification. Also if there are other aspects of BookKeeper that could benefit from verification let me know. Best regards Jack -- *Jack Vanlightly* Principal Software Engineer Splunk Inc. jvanligh...@splunk.com <kramas...@splunk.com> Barcelona