In terms of tools, thinking processes, setup... you really need to start by learning about TLA+. I couldn't explain how it changes your engineering mindset/workflow better than Leslie Lamport himself. So if you're interested in this kind of thing then I highly recommend watching the TLA+ video course by Leslie Lamport: https://www.youtube.com/channel/UCajiu4Cj_GHOX0if3Up-eRA. I personally had to watch it more than once to understand some points.
Leslie Lamport also wrote a book which has been helpful to me: https://lamport.azurewebsites.net/tla/book.html. The book will require a fair amount of study to understand it all. For me, the liveness stuff came later, after I'd already written a couple of safety oriented specs. Next up there are some examples you can learn from here: https://github.com/tlaplus/CommunityModules When you start writing your first spec you will hit brick walls along the way. I highly recommend post your questions here: https://groups.google.com/g/tlaplus. The community is very responsive and helpful for all levels. Regular engineers can pick up enough TLA+ in 2-3 weeks to write a useful spec, you don't need a PhD in mathematics. Once you can write something simple, then find something real to model. If you don't have a specific thing you want to write a spec for then one thing you can do is take a distributed system that you know and look for a protocol bug that has since been fixed. Try to implement the most minimum spec that recreates the bug. An example of this I did was the Apache Kafka KIP 101 which was a data loss protocol bug. https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation Once you can put together a basic spec, find something that you want to verify, whether it exists already or it is a design proposal. I recently modelled Pravegas auto-scaling topics which is not too complex and makes for a good first spec (as well as being related to BookKeeper). On Mon, Feb 22, 2021 at 7:15 PM Karan Mehta <k.me...@salesforce.com.invalid> wrote: > [ External sender. Exercise caution. ] > > Awesome work Jack! > > I haven't worked with formal verification but have always been interested. > Can you share the tools, thinking processes, setup etc that went into this? > > > On Mon, Feb 22, 2021 at 9:19 AM Jack Vanlightly > <jvanligh...@splunk.com.invalid> wrote: > > > Hi all, > > > > I have written a TLA+ specification for the replication protocol. It > models > > the life of a single ledger and includes two writers, the original writer > > that creates the ledger and a second writer that can at any point (or > not) > > start the recovery process. > > > > It found one problem with the fencing that allowed for a data loss > scenario > > to occur. You can find the details of that particular problem in issue > 2614 > > ( > > > https://urldefense.com/v3/__https://github.com/apache/bookkeeper/issues/2614__;!!DCbAVzZNrAf4!VUV5pNTOHyZQOg1ZhwuWT7JIcxobv26MM14ErklQukaMnxsSftnbe-mMrmNe4qfOnA$ > > ). > > > > The specification is here: > > > https://urldefense.com/v3/__https://github.com/Vanlightly/bookkeeper-tlaplus__;!!DCbAVzZNrAf4!VUV5pNTOHyZQOg1ZhwuWT7JIcxobv26MM14ErklQukaMnxsSftnbe-mMrmOuWn47BA$ > > > > The state space for a 4 broker deployment with a Qw 3, Qa 2 ledger is > > pretty large and takes days to run on a 96 core beast. However, you can > run > > it with smaller ensemble sizes to run in mere hours. > > > > Check it out if you're interested in formal verification. Also if there > are > > other aspects of BookKeeper that could benefit from verification let me > > know. > > > > Best regards > > Jack > > > > > > -- > > *Jack Vanlightly* > > Principal Software Engineer > > Splunk Inc. > > jvanligh...@splunk.com <kramas...@splunk.com> > > Barcelona > > > > > -- > Karan Mehta > > <https://smart.salesforce.com/sig/k.mehta//us_mb/default/link.html> >