Thanks Sijie I will do my best. I can try to separate: 1) protocol changes (protobuf) 2) new client side API 3) LAC protocol changes bookie side changes 4) additional tests
Actually I already have a private work-in-progress branch with the full stack, I will finish to implement the document and the split into pieces. b.q. I left one comment on the doc about the retention of the SyncCounter on the bookie side -- Enrico 2017-09-12 10:08 GMT+02:00 Sijie Guo <guosi...@gmail.com>: > Cool. > > I would expect this is a big change. It would be good if you can divide it > into smaller tasks, so people can review them easier. > > - Sijie > > On Tue, Sep 12, 2017 at 1:05 AM, Enrico Olivelli <eolive...@gmail.com> > wrote: > > > Thank you all ! > > > > I will copy the content of the Final draft to the Wiki and mark the > > document as "Accepted" > > > > I will send a PR soon but it will depend on BP-15 New CreateLeader API > > > > I hope we could make it for 4.6 > > > > > > Enrico > > > > > > 2017-09-11 18:58 GMT+02:00 Sijie Guo <guosi...@gmail.com>: > > > > > Enrico, > > > > > > Feel free to close the thread and mark this BP as accepted, if there is > > no > > > -1. > > > > > > - Sijie > > > > > > On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eolive...@gmail.com> > > > wrote: > > > > > > > Ping > > > > > > > > 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>: > > > > > > > > > Hi all, > > > > > > > > > > > > > > > You can find the revised proposal here > > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/ > > > > > BP-14+Relax+durability > > > > > > > > > > The link to the document open for comments is this: > > > > > https://docs.google.com/document/d/1yNi9t2_ > > > > deOOMXDaGzrnmaHTQeB3B3Fnym82DU > > > > > ERH7LM/edit?usp=sharing > > > > > > > > > > Please check it out > > > > > We are going to review this Proposal at the meeting > > > > > > > > > > -- Enrico > > > > > > > > > > > > > > > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>: > > > > > > > > > >> Thank you Sijie for summarizing and thanks to the community for > > > helping > > > > >> in this important enhancement to BookKeeper > > > > >> > > > > >> I am convinced that as JV pointed out we need to declare at ledger > > > > >> creation time that the ledger is going to perform no-sync writes. > > > > >> > > > > >> I think we need an explicit declaration currently to make things > > > "clear" > > > > >> to the developer which is using the LedgerHandle API even and > ledger > > > > >> creation tyime. > > > > >> > > > > >> The case is that we are going to forbid "striping" ledgers > (ensemble > > > > size > > > > >> > quorum size) for no-sync writes in the first implementation: > > > > >> - one option is to fail at the first no-sync addEntry, but this > > will > > > be > > > > >> really uncomfortable because usually the ack/write/ensemble sizes > > are > > > > >> configured by the admin, and there will be configurations in which > > > > errors > > > > >> will come out only after starting the system. > > > > >> - the second option is to make the developer explicitly enable > > no-sync > > > > >> writes at creation time and fail the creation of the ledger if the > > > > >> requested combination of options if not possible > > > > >> > > > > >> I am not sure that the changes to the bookie internals are a > > > Client-API > > > > >> matter, maybe we can leverage custom metadata (as JV said) in > order > > to > > > > make > > > > >> the bookie handle ledgers in a different manner, this way will be > > > always > > > > >> open as custom metadata are already here. > > > > >> > > > > >> JV preferred the ledger-type approach, the dual solution is to > > > introduce > > > > >> a list of "capabilities" or "ledger options". > > > > >> I think that this ability to perform no-syc writes is so important > > > that > > > > >> "custom metadata" is not the good place to declare it, same for > > > "ledger > > > > >> type" > > > > >> > > > > >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger > > > > creation > > > > >> time, without writing in to ledger metadata on ZK, > > > > >> I think that if further improvements will need ledger metadata > > changes > > > > we > > > > >> will do. > > > > >> > > > > >> I have updated the BP-14 document, I have added an "Open issues" > > > footer > > > > >> with the open points, > > > > >> please add comments and I will correct the document as soon as > > > possible. > > > > >> > > > > >> > > > > >> Enrico > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <guosi...@gmail.com>: > > > > >> > > > > >>> Thank you, Enrico, JV. > > > > >>> > > > > >>> These are great discussions. > > > > >>> > > > > >>> After reading these two proposals, I have a few very high-level > > > > comments, > > > > >>> dividing into three categories. > > > > >>> > > > > >>> > > > > >>> *API* > > > > >>> > > > > >>> - I think there are not fundamentally differences between these > two > > > > >>> proposals. > > > > >>> They are trying to achieve similar goals by exposing durability > > > levels > > > > in > > > > >>> different way. > > > > >>> So this will be a discussion on what API/interface should look > like > > > > from > > > > >>> user / admin perspective. > > > > >>> I would suggest focusing what would be the API itself, putting > the > > > > >>> implementation design aside when talking about this. > > > > >>> > > > > >>> *Core* > > > > >>> > > > > >>> - Both proposals need to deal with a core function - what happen > to > > > LAC > > > > >>> and > > > > >>> what semantic that bookkeeper provides. > > > > >>> JV did a good summary in his proposal. However I am not a fan of > > > > >>> maintaining two different semantics. So I am looking for > > > > >>> a solution that bookkeeper can only maintain one semantic. The > > > semantic > > > > >>> is > > > > >>> basically: > > > > >>> > > > > >>> 1) LAC only advanced when entries before LAC are committed to the > > > > >>> persistent storage > > > > >>> 2) All the entries until LAC are successfully committed to the > > > > >>> persistence > > > > >>> storage > > > > >>> 3) Entries until LAC: all the entries must be readable all the > > time. > > > > >>> > > > > >>> If we maintain such semantic, there is no need to change the auto > > > > >>> recovery > > > > >>> protocol in bookkeeper. All what we guarantee are the entries > > durably > > > > >>> persistent. > > > > >>> > > > > >>> In order to maintain such semantic, I think both me and JV > proposed > > > > >>> similar > > > > >>> solution in either proposal. I am trying to finalize one here: > > > > >>> > > > > >>> * bookie maintains a LAS (Last Add Synced) point for each entry. > > > > >>> * LAS can be piggybacked on AddResponses > > > > >>> * Client uses the LAS to advance LAC. > > > > >>> > > > > >>> If we can agree on the core semantic we are going to provide, the > > > other > > > > >>> things are just logistics. > > > > >>> > > > > >>> *Others* > > > > >>> > > > > >>> - Regarding separating journal or bypassing journal, there is no > > > > >>> difference > > > > >>> when we talking from the core semantic. They are all non-durably > > > writes > > > > >>> (acknowledging before fsyncing). > > > > >>> We can start with same journal approach (but just acknowledge > > before > > > > >>> fsyncing), implement the core and add other options later on. > > > > >>> > > > > >>> > > > > >>> From my point of view, I'd be more interesting in providing a > > single > > > > >>> consistent durable semantic that application can rely on for both > > > > durable > > > > >>> writes and non-durable writes. The other stuffs seem to be more > > > > logistics > > > > >>> things. > > > > >>> > > > > >>> > > > > >>> - Sijie > > > > >>> > > > > >>> > > > > >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli < > > > eolive...@gmail.com > > > > > > > > > >>> wrote: > > > > >>> > > > > >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri < > > > > jujj...@gmail.com > > > > >>> >: > > > > >>> > > > > > >>> > > I don't believe I fully followed your second case. But even > in > > > this > > > > >>> case, > > > > >>> > > your major concern is about the additional 'sync' RPC? > > > > >>> > > > > > > >>> > > > > > >>> > yes apart from that I am fine with your proposal too, that is > to > > > > have a > > > > >>> > LedgerType which drives durability > > > > >>> > and I think we need to add per-entry durability options > > > > >>> > > > > > >>> > I think that at least for the 'simple' no-sync addEntry we do > not > > > > need > > > > >>> to > > > > >>> > change many things, I am drafting a prototype, I will share it > as > > > > soon > > > > >>> as > > > > >>> > we all agree on the roadmap > > > > >>> > > > > > >>> > The first implementation can cover the first cases (no-sync > > > addEntry) > > > > >>> and > > > > >>> > change the way the writer advances the LAC in order to support > > > > 'relaxed > > > > >>> > durability writes'. > > > > >>> > This change will be compatible with future improvements and it > > will > > > > >>> open > > > > >>> > the door for big changes on the bookie side like bypassing the > > > > journal > > > > >>> or > > > > >>> > leveraging multiple journals..... > > > > >>> > > > > > >>> > -- Enrico > > > > >>> > > > > > >>> > or something else that the LedgerType proposal won't work? > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > >>> > > > > > > >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli < > > > > >>> eolive...@gmail.com> > > > > >>> > > wrote: > > > > >>> > > > > > > >>> > > > I think that having a set of options on the ledger metadata > > > will > > > > >>> be a > > > > >>> > > good > > > > >>> > > > enhancement and I am sure we will do it as soon as it will > be > > > > >>> needed, > > > > >>> > > maybe > > > > >>> > > > we do not need it now. > > > > >>> > > > > > > > >>> > > > Actually I think we will need to declare this > > durability-level > > > at > > > > >>> entry > > > > >>> > > > level to support some uses cases in BP-14 document, let me > > > > explain > > > > >>> two > > > > >>> > of > > > > >>> > > > my usecases for which I need it: > > > > >>> > > > > > > > >>> > > > At higher level we have to choices: > > > > >>> > > > > > > > >>> > > > A) per-ledger durability options (JV proposal) > > > > >>> > > > all addEntry operations are durable or non-durable and > there > > is > > > > an > > > > >>> > > explicit > > > > >>> > > > 'sync' API (+ forced sync at close) > > > > >>> > > > > > > > >>> > > > B) per-entry durability options (original BP-14 proposal) > > > > >>> > > > every addEntry has an own durable/non-durable option > > > > >>> (sync/no-sync), > > > > >>> > with > > > > >>> > > > the ability to call 'sync' without addEntry (+ forced sync > at > > > > >>> close) > > > > >>> > > > > > > > >>> > > > I am speaking about the the database WAL case, I am using > the > > > > >>> ledger as > > > > >>> > > > segment for the WAL of a database and I am writing all data > > > > >>> changes in > > > > >>> > > the > > > > >>> > > > scope of a 'transaction' with the relaxed-durability flag, > > > then I > > > > >>> am > > > > >>> > > > writing the 'transaction committed' entry with "strict > > > > durability" > > > > >>> > > > requirement, this will in fact require that all previous > > > entries > > > > >>> are > > > > >>> > > > persisted durably and so that the transaction will never be > > > lost. > > > > >>> > > > > > > > >>> > > > In this scenario we would need an addEntry + sync API in > > fact: > > > > >>> > > > > > > > >>> > > > using option A) the WAL will look like: > > > > >>> > > > - open ledger no-sync = true > > > > >>> > > > - addEntry (set foo=bar) (this will be no-sync) > > > > >>> > > > - addEntry (set foo=bar2) (this will be no-sync) > > > > >>> > > > - addEntry (commit) > > > > >>> > > > - sync > > > > >>> > > > > > > > >>> > > > using option B) the WAL will look like > > > > >>> > > > - open ledger > > > > >>> > > > - addEntry (set foo=bar), no-sync > > > > >>> > > > - addEntry (set foo=bar2), no-sync > > > > >>> > > > - addEntry (commit), sync > > > > >>> > > > > > > > >>> > > > in case B) we are "saving" one RPC call to every bookie > (the > > > > 'sync' > > > > >>> > one) > > > > >>> > > > same for single data change entries, like updating a single > > > > record > > > > >>> on > > > > >>> > the > > > > >>> > > > database, this with BK 4.5 "costs" only a single RPC to > every > > > > >>> bookie > > > > >>> > > > > > > > >>> > > > Second case: > > > > >>> > > > I am using BookKeeper to store binary objects, so I am > > packing > > > > more > > > > >>> > > > 'objects' (named sequences of bytes) into a single ledger, > > like > > > > >>> you do > > > > >>> > > when > > > > >>> > > > you write many records to a file in a streaming fashion and > > > keep > > > > >>> track > > > > >>> > of > > > > >>> > > > offsets of the beginning of every record (LedgerHandeAdv is > > > > >>> perfect for > > > > >>> > > > this case). > > > > >>> > > > I am not using a single ledger per 'file' because it kills > > > > >>> zookeeper to > > > > >>> > > > create many ledgers very fast, in my systems I have big > busts > > > of > > > > >>> > writes, > > > > >>> > > > which need to be really "fast", so I am writing multiple > > > 'files' > > > > to > > > > >>> > every > > > > >>> > > > single ledger. So the close-to-open consistency at ledger > > level > > > > is > > > > >>> not > > > > >>> > > > suitable for this case. > > > > >>> > > > I have to write as fast as possible to this 'ledger-backed' > > > > >>> stream, and > > > > >>> > > as > > > > >>> > > > with a 'traditional' filesystem I am writing parts of each > > > file > > > > >>> and > > > > >>> > than > > > > >>> > > > requiring 'sync' at the end of each file. > > > > >>> > > > Using BookKeeper you need to split big 'files' into > "little" > > > > >>> parts, you > > > > >>> > > > cannot transmit the contents as to "real" stream on > network. > > > > >>> > > > > > > > >>> > > > I am not talking about bookie level implementation details > I > > > > would > > > > >>> like > > > > >>> > > to > > > > >>> > > > define the high level API in order to support all the > > relevant > > > > >>> known > > > > >>> > use > > > > >>> > > > cases and keep space for the future, > > > > >>> > > > at this moment adding a per-entry 'durability option' seems > > to > > > be > > > > >>> very > > > > >>> > > > flexible and simple to implement, it does not prevent us > from > > > > doing > > > > >>> > > further > > > > >>> > > > improvements, like namely skipping the journal. > > > > >>> > > > > > > > >>> > > > Enrico > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli < > > > eolive...@gmail.com > > > > >: > > > > >>> > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri < > > > > >>> > jujj...@gmail.com> > > > > >>> > > > > wrote: > > > > >>> > > > > > > > > >>> > > > >> Hi all, > > > > >>> > > > >> > > > > >>> > > > >> As promised during Thursday call, here is my proposal. > > > > >>> > > > >> > > > > >>> > > > >> *NOTE*: Major difference in this proposal compared to > > > Enrico’s > > > > >>> > > > >> <https://docs.google.com/document/d/ > > 1JLYO3K3tZ5PJGmyS0YK_- > > > > >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v> > > > > >>> > > > >> is > > > > >>> > > > >> making the durability a property of the ledger(type) as > > > > opposed > > > > >>> to > > > > >>> > > > >> addEntry(). Rest of the technical details have a lot of > > > > >>> > similarities. > > > > >>> > > > >> > > > > >>> > > > > > > > > >>> > > > > Thank you JV. I have just read quickly the doc and your > > view > > > is > > > > >>> > > centantly > > > > >>> > > > > broader. > > > > >>> > > > > I will dig into the doc as soon as possible on Monday. > > > > >>> > > > > For me it is ok to have a ledger wide configuration I > think > > > > that > > > > >>> the > > > > >>> > > most > > > > >>> > > > > important decision is about the API we will provide as in > > the > > > > >>> future > > > > >>> > it > > > > >>> > > > > will be difficult to change it. > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > Cheers > > > > >>> > > > > Enrico > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > >> https://docs.google.com/document/d/ > > > 1g1eBcVVCZrTG8YZliZP0LVqv > > > > >>> Wpq43 > > > > >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing > > > > >>> > > > >> > > > > >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli < > > > > >>> > eolive...@gmail.com > > > > >>> > > > > > > > >>> > > > >> wrote: > > > > >>> > > > >> > > > > >>> > > > >> > Thank you all for the comments and for taking a look > to > > > the > > > > >>> > document > > > > >>> > > > so > > > > >>> > > > >> > soon. > > > > >>> > > > >> > I have updated the doc, we will discuss the document > at > > > the > > > > >>> > meeting, > > > > >>> > > > >> > > > > > >>> > > > >> > > > > > >>> > > > >> > Enrico > > > > >>> > > > >> > > > > > >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo < > guosi...@gmail.com > > >: > > > > >>> > > > >> > > > > > >>> > > > >> > > Enrico, > > > > >>> > > > >> > > > > > > >>> > > > >> > > Thank you so much! It is a great effort for putting > > this > > > > up. > > > > >>> > > Overall > > > > >>> > > > >> > looks > > > > >>> > > > >> > > good. I made some comments, we can discuss at > > tomorrow's > > > > >>> > community > > > > >>> > > > >> > meeting. > > > > >>> > > > >> > > > > > > >>> > > > >> > > - Sijie > > > > >>> > > > >> > > > > > > >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli < > > > > >>> > > > eolive...@gmail.com > > > > >>> > > > >> > > > > > >>> > > > >> > > wrote: > > > > >>> > > > >> > > > > > > >>> > > > >> > > > Hi all, > > > > >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax > > > > >>> Durability > > > > >>> > > > >> > > > > > > > >>> > > > >> > > > We are talking about limiting the number of fsync > to > > > the > > > > >>> > journal > > > > >>> > > > >> while > > > > >>> > > > >> > > > preserving the correctness of the LAC protocol. > > > > >>> > > > >> > > > > > > > >>> > > > >> > > > This is the link to the wiki page, but as the > issue > > is > > > > >>> huge we > > > > >>> > > > >> prefer > > > > >>> > > > >> > to > > > > >>> > > > >> > > > use Google Documents for sharing comments > > > > >>> > > > >> > > > https://cwiki.apache.org/ > > > confluence/display/BOOKKEEPER/ > > > > >>> > > > >> > > > BP+-+14+Relax+durability > > > > >>> > > > >> > > > > > > > >>> > > > >> > > > This is the document > > > > >>> > > > >> > > > https://docs.google.com/document/d/ > > > > 1JLYO3K3tZ5PJGmyS0YK_- > > > > >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing > > > > >>> > > > >> > > > > > > > >>> > > > >> > > > All comments are welcome > > > > >>> > > > >> > > > > > > > >>> > > > >> > > > I have added DL dev list in cc as the discussion > is > > > > >>> > interesting > > > > >>> > > > for > > > > >>> > > > >> > both > > > > >>> > > > >> > > > groups > > > > >>> > > > >> > > > > > > > >>> > > > >> > > > Enrico Olivelli > > > > >>> > > > >> > > > > > > > >>> > > > >> > > > > > > >>> > > > >> > > > > > >>> > > > >> > > > > >>> > > > >> > > > > >>> > > > >> > > > > >>> > > > >> -- > > > > >>> > > > >> Jvrao > > > > >>> > > > >> --- > > > > >>> > > > >> First they ignore you, then they laugh at you, then they > > > fight > > > > >>> you, > > > > >>> > > then > > > > >>> > > > >> you win. - Mahatma Gandhi > > > > >>> > > > >> > > > > >>> > > > > -- > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > -- Enrico Olivelli > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > -- > > > > >>> > > Jvrao > > > > >>> > > --- > > > > >>> > > First they ignore you, then they laugh at you, then they > fight > > > you, > > > > >>> then > > > > >>> > > you win. - Mahatma Gandhi > > > > >>> > > > > > > >>> > > > > > >>> > > > > >> > > > > >> > > > > > > > > > > > > > > >