Thanks Jack, this is quite helpful. I will check them out. On Tue, Feb 23, 2021 at 12:52 AM Jack Vanlightly <jvanligh...@splunk.com.invalid> wrote:
> In terms of tools, thinking processes, setup... you really need to start by > learning about TLA+. I couldn't explain how it changes your engineering > mindset/workflow better than Leslie Lamport himself. So if you're > interested in this kind of thing then I highly recommend watching the TLA+ > video course by Leslie Lamport: > > https://urldefense.com/v3/__https://www.youtube.com/channel/UCajiu4Cj_GHOX0if3Up-eRA__;!!DCbAVzZNrAf4!T7-W6Mvn3QiJ-yydRe46KPWA8QdglLKQzuRtxcm1p0hOne9XwSQk-OFRo0izO-It_w$ > . I personally had > to watch it more than once to understand some points. > > Leslie Lamport also wrote a book which has been helpful to me: > > https://urldefense.com/v3/__https://lamport.azurewebsites.net/tla/book.html__;!!DCbAVzZNrAf4!T7-W6Mvn3QiJ-yydRe46KPWA8QdglLKQzuRtxcm1p0hOne9XwSQk-OFRo0gS54kMaw$ > . The book will require a > fair amount of study to understand it all. For me, the liveness stuff came > later, after I'd already written a couple of safety oriented specs. > > Next up there are some examples you can learn from here: > > https://urldefense.com/v3/__https://github.com/tlaplus/CommunityModules__;!!DCbAVzZNrAf4!T7-W6Mvn3QiJ-yydRe46KPWA8QdglLKQzuRtxcm1p0hOne9XwSQk-OFRo0ij0OnNJA$ > > When you start writing your first spec you will hit brick walls along the > way. I highly recommend post your questions here: > > https://urldefense.com/v3/__https://groups.google.com/g/tlaplus__;!!DCbAVzZNrAf4!T7-W6Mvn3QiJ-yydRe46KPWA8QdglLKQzuRtxcm1p0hOne9XwSQk-OFRo0i76sUkQw$ > . The community is very responsive and > helpful for all levels. > > Regular engineers can pick up enough TLA+ in 2-3 weeks to write a useful > spec, you don't need a PhD in mathematics. Once you can write something > simple, then find something real to model. If you don't have a specific > thing you want to write a spec for then one thing you can do is take a > distributed system that you know and look for a protocol bug that has since > been fixed. Try to implement the most minimum spec that recreates the bug. > An example of this I did was the Apache Kafka KIP 101 which was a data loss > protocol bug. > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-101*-*Alter*Replication*Protocol*to*use*Leader*Epoch*rather*than*High*Watermark*for*Truncation__;KysrKysrKysrKysrKys!!DCbAVzZNrAf4!T7-W6Mvn3QiJ-yydRe46KPWA8QdglLKQzuRtxcm1p0hOne9XwSQk-OFRo0j6dCm0jg$ > > Once you can put together a basic spec, find something that you want to > verify, whether it exists already or it is a design proposal. I recently > modelled Pravegas auto-scaling topics which is not too complex and makes > for a good first spec (as well as being related to BookKeeper). > > > > > On Mon, Feb 22, 2021 at 7:15 PM Karan Mehta <k.me...@salesforce.com > .invalid> > wrote: > > > [ External sender. Exercise caution. ] > > > > Awesome work Jack! > > > > I haven't worked with formal verification but have always been > interested. > > Can you share the tools, thinking processes, setup etc that went into > this? > > > > > > On Mon, Feb 22, 2021 at 9:19 AM Jack Vanlightly > > <jvanligh...@splunk.com.invalid> wrote: > > > > > Hi all, > > > > > > I have written a TLA+ specification for the replication protocol. It > > models > > > the life of a single ledger and includes two writers, the original > writer > > > that creates the ledger and a second writer that can at any point (or > > not) > > > start the recovery process. > > > > > > It found one problem with the fencing that allowed for a data loss > > scenario > > > to occur. You can find the details of that particular problem in issue > > 2614 > > > ( > > > > > > https://urldefense.com/v3/__https://github.com/apache/bookkeeper/issues/2614__;!!DCbAVzZNrAf4!VUV5pNTOHyZQOg1ZhwuWT7JIcxobv26MM14ErklQukaMnxsSftnbe-mMrmNe4qfOnA$ > > > ). > > > > > > The specification is here: > > > > > > https://urldefense.com/v3/__https://github.com/Vanlightly/bookkeeper-tlaplus__;!!DCbAVzZNrAf4!VUV5pNTOHyZQOg1ZhwuWT7JIcxobv26MM14ErklQukaMnxsSftnbe-mMrmOuWn47BA$ > > > > > > The state space for a 4 broker deployment with a Qw 3, Qa 2 ledger is > > > pretty large and takes days to run on a 96 core beast. However, you can > > run > > > it with smaller ensemble sizes to run in mere hours. > > > > > > Check it out if you're interested in formal verification. Also if there > > are > > > other aspects of BookKeeper that could benefit from verification let me > > > know. > > > > > > Best regards > > > Jack > > > > > > > > > -- > > > *Jack Vanlightly* > > > Principal Software Engineer > > > Splunk Inc. > > > jvanligh...@splunk.com <kramas...@splunk.com> > > > Barcelona > > > > > > > > > -- > > Karan Mehta > > > > <https://smart.salesforce.com/sig/k.mehta//us_mb/default/link.html> > > > -- Karan Mehta <https://smart.salesforce.com/sig/k.mehta//us_mb/default/link.html>