Re: TLA+ specification for the BookKeeper replication protocol

Jack Vanlightly Tue, 23 Feb 2021 00:52:50 -0800

In terms of tools, thinking processes, setup... you really need to start by
learning about TLA+. I couldn't explain how it changes your engineering
mindset/workflow better than Leslie Lamport himself. So if you're
interested in this kind of thing then I highly recommend watching the TLA+
video course by Leslie Lamport:
https://www.youtube.com/channel/UCajiu4Cj_GHOX0if3Up-eRA. I personally had
to watch it more than once to understand some points.

Leslie Lamport also wrote a book which has been helpful to me:
https://lamport.azurewebsites.net/tla/book.html. The book will require a
fair amount of study to understand it all. For me, the liveness stuff came
later, after I'd already written a couple of safety oriented specs.

Next up there are some examples you can learn from here:
https://github.com/tlaplus/CommunityModules

When you start writing your first spec you will hit brick walls along the
way. I highly recommend post your questions here:
https://groups.google.com/g/tlaplus. The community is very responsive and
helpful for all levels.

Regular engineers can pick up enough TLA+ in 2-3 weeks to write a useful
spec, you don't need a PhD in mathematics. Once you can write something
simple, then find something real to model. If you don't have a specific
thing you want to write a spec for then one thing you can do is take a
distributed system that you know and look for a protocol bug that has since
been fixed. Try to implement the most minimum spec that recreates the bug.
An example of this I did was the Apache Kafka KIP 101 which was a data loss
protocol bug.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation

Once you can put together a basic spec, find something that you want to
verify, whether it exists already or it is a design proposal. I recently
modelled Pravegas auto-scaling topics which is not too complex and makes
for a good first spec (as well as being related to BookKeeper).

On Mon, Feb 22, 2021 at 7:15 PM Karan Mehta <[email protected]>
wrote:

> [ External sender. Exercise caution. ]
>
> Awesome work Jack!
>
> I haven't worked with formal verification but have always been interested.
> Can you share the tools, thinking processes, setup etc that went into this?
>
>
> On Mon, Feb 22, 2021 at 9:19 AM Jack Vanlightly
> <[email protected]> wrote:
>
> > Hi all,
> >
> > I have written a TLA+ specification for the replication protocol. It
> models
> > the life of a single ledger and includes two writers, the original writer
> > that creates the ledger and a second writer that can at any point (or
> not)
> > start the recovery process.
> >
> > It found one problem with the fencing that allowed for a data loss
> scenario
> > to occur. You can find the details of that particular problem in issue
> 2614
> > (
> >
> https://urldefense.com/v3/__https://github.com/apache/bookkeeper/issues/2614__;!!DCbAVzZNrAf4!VUV5pNTOHyZQOg1ZhwuWT7JIcxobv26MM14ErklQukaMnxsSftnbe-mMrmNe4qfOnA$
> > ).
> >
> > The specification is here:
> >
> https://urldefense.com/v3/__https://github.com/Vanlightly/bookkeeper-tlaplus__;!!DCbAVzZNrAf4!VUV5pNTOHyZQOg1ZhwuWT7JIcxobv26MM14ErklQukaMnxsSftnbe-mMrmOuWn47BA$
> >
> > The state space for a 4 broker deployment with a Qw 3, Qa 2 ledger is
> > pretty large and takes days to run on a 96 core beast. However, you can
> run
> > it with smaller ensemble sizes to run in mere hours.
> >
> > Check it out if you're interested in formal verification. Also if there
> are
> > other aspects of BookKeeper that could benefit from verification let me
> > know.
> >
> > Best regards
> > Jack
> >
> >
> > --
> > *Jack Vanlightly*
> > Principal Software Engineer
> > Splunk Inc.
> > [email protected] <[email protected]>
> > Barcelona
> >
>
>
> --
> Karan Mehta
>
> <https://smart.salesforce.com/sig/k.mehta//us_mb/default/link.html>
>

Re: TLA+ specification for the BookKeeper replication protocol

Reply via email to