Hi all, It has been a great and lively discussion. I can say this is one of the highly trended topics in the recent BK community discussion. Kudos to Enrico for starting this.
Enrico, Sijie and I met and discussed this further and came up with the following consensus on how to move forward. * Introduce LedgerType/LedgerProperties which goes into ZK metadata. * No changes to AddEntry API (application view); but AddEntry RPC will add a flag to bookies to inform about the type/durability. * Introduce a sync() RPC which needs to be called explicitly on RD ledgers. * No changes to LAC and how we update it. * No changes to the behavior of readEntries() API, which reads only until LAC. * Applications can use readUnConcirmed API to read until last add pushed. * Segregate stats based on the ledgertype. Enrico is going to merge two docs and publish a detailed design. Thanks a lot Enrico On Mon, Aug 21, 2017 at 10:01 PM, Sijie Guo <guosi...@gmail.com> wrote: > On Aug 21, 2017 5:44 AM, "Enrico Olivelli" <eolive...@gmail.com> wrote: > > As the issue is really huge, I need to narrow the design and implementation > efforts to a specific case at the moment: I am interested in having a > per-ledger flag to not require fsynch on entries on journal. > > > It is good to narrow down the implementation. However because there are > different requirements from different people. It would be good to discuss > and cover all thoughts. > > > If the "no-synch" flag is applied per ledger than we have to decide what to > do on the LAC protocol, I see two opposite ways: > 1) the LAC will never advanced (no fsynch is guaranteed on journal) > 2) the LAC is advanced as usual but it will be possible to have missing > entries > > > Personally I am -1 to approach 2) as for the reasons I stated in previous > emails. > > > There is a "gray" situation: > 3) as entries will be interleaved on the journal with entries of other > "synch" ledgers it will be possible to detect some kind of "synched" > entries and return the info to the writing client which in turn will be > able to advance the LAC: > this option is not useful as the behavior is unpredictable > > For my "urgent" usecase I would prefer 2), but 1) is possible too, because > I am using LedgerHandlerAdv (I have manual allocation of entry ids) + > readUnconfirmedEntries (which allows to read entries even if LAC did not > advance) > > > As JV suggested, please start the design doc and let's iterate over it > before the implementation. > > > -- Enrico > > > 2017-08-19 14:09 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>: > > > > > > > On ven 18 ago 2017, 20:12 Sijie Guo <guosi...@gmail.com> wrote: > > > >> /cc (distributedlog-dev@) > >> > >> I know JV has similar use cases. This might require a broad discussion. > >> The > >> most tricky part would be LAC protocol - when can the client advance the > >> LAC. I think a BP, initially with a google doc shared to the community > >> would be good to start the discussion. because I would expect a lot > points > >> to discuss for this topic. Once we finalize the details, we can copy the > >> google doc content back to the wiki page. > >> > > > > Thank you Sijie and JV for pointing me to the right direction. > > I had underestimated the problems related to the ensemble changes, and > > also effectively in my projects it can happen that a single > 'transaction' > > can span more then one ledger so the ordering issues are nore complex > than > > I expected. If somehow it would be possible to keep ordering inside the > > scope of a single ledger it is very hard to get it using multiple > ledgers. > > > > Next week I will write the doc, but I think I am going to split the > > problem into multiple parts. > > I see that the LAC must be advanced only when an fsynch is done. This > will > > preserve correctness as Sijie told. > > > > I think that the problems related to the ordering of events must be > > addressed at application level and it would be the best thing to have > such > > support in DL. > > > > For instance at first glance I omage that we should add in BK some > support > > in order to let the application receive notifications of changes to LAC > to > > the writer more easily. > > > > The first step would be to add a new flag to addEntry to receive > > acknowledge on fwrite and flush (with the needed changes to the journal), > > and in the addresponse a flag wjich tells that the entry has been synched > > or only flushed, and handle the LAC according to this information. > > > > Other comments inline > > Enrico > > > > > > > > > > > >> Other comments inline: > >> > >> > >> On Thu, Aug 17, 2017 at 4:42 AM, Enrico Olivelli <eolive...@gmail.com> > >> wrote: > >> > >> > Hi, > >> > I am working with my colleagues at an implementation to relax the > >> > constraint that every acknowledged entry must have been successfully > >> > written and fsynced to disk at journal level. > >> > > >> > The idea is to have a flag in addEntry to ask for acknowledge not > after > >> the > >> > fsync in journal but only when data has been successfully written and > >> > flushed to the SO. > >> > > >> > I have the requirement that if an entry requires synch all the entries > >> > successfully sent 'before' that entry (causality) are synched too, > even > >> if > >> > they have been added with the new relaxed durability flag. > >> > >> > >> > Imagine a database transaction log, during a transaction I will write > >> every > >> > change to data to the WAL with the new flag, and only the commit > >> > transaction command will be added with synch requirement. The idea is > >> that > >> > all the changes inside the scope of the transaction have a meaning > only > >> if > >> > the transaction is committed, so it is important that the commit entry > >> > won't be lost and if that entry isn't lost all of the other entries of > >> the > >> > same transaction aren't lost too. > >> > > >> > >> can you do: > >> > >> - lh.asyncAddEntry('entry-1') > >> - lh.asyncAddEntry('entry-2') > >> - lh.addEntry('commit') > >> > >> ? > >> > > > > Yes, currently ut is the best we can do and I am doing so > > > > > >> Does this work for you? If it doesn't, what is the problem? do you have > >> any > >> performance number to support why this doesn't work? > >> > > > > I do not have numbers for this case, ingeneral limiting the number for > > fsynch could bring better performances. > > It is hard to play with grouping settings in the journal > > > > > >> > >> > > >> > I have another use case. In another project I am storing binary > objects > >> > into BK and I have to obtain great performance even on single disk > >> bookie > >> > layouts (journal + data + index on the same partition). > >> > >> In this project it > >> > is acceptable to compensate the risk of not doing fsynch if requesting > >> > enough replication. > >> > IMHO it will be somehow like the Kakfa idea of durability, as far as I > >> know > >> > Kafka by default does not impose fsynch but it leaves all to the SO > and > >> to > >> > the fact that there is a minimal configurable number of replicas which > >> are > >> > in-synch. > >> > >> > >> > >> when you are talking about kafka durability, what durability level are > you > >> looking for? Are you looking for replication durability without fsync? > >> > > > > Yes, the clients waits for acks from a number of brokers, which do not > > necessarily have performed fsynch. Dataloss risk is mitigated by > replication > > > > > >> > >> > >> > >> > >> > >> > >> > >> > >> > > >> > There are many open points, already suggested by Matteo, JV and Sijie: > >> > - LAC protocol? > >> > - replication in case of lost entries? > >> > - under production load mixing non synched entries with synched > entries > >> > will not give much benefits > >> > > >> > >> a couple thoughts to this feature: > >> > >> 1) we should always stick to a rule: LAC should only be advanced on > >> receiving acknowledgement of entries (persist on disk after fsync, it > can > >> bypass journal if necessary). so all the assumptions for LAC, > replication > >> can remain same and no change is needed. > >> > >> 2) separate the acknowledgement of replication and the acknowledgement > of > >> fsync (LAC) can achieve 'replicated durability without fsync' while > still > >> maintain the correctness of LAC. That means: > >> > >> an add request (no-sync) can be completed after receiving enough > responses > >> from bookies, however the response of (no-sync) add can't advance LAC. > The > >> LAC can only be advanced on acknowledgement of sync adds. > >> > >> > >> 3) request ordering and ensemble changes will make things complicated to > >> ensure correctness. the elegancy of current replication durability with > >> fsync is you don't rely on request ordering or physical layout to ensure > >> ordering and correctness. However if you relax durability and mixing > sync > >> adds and fsync adds, you have to pay attention to request ordering and > >> flush ordering to ensure correctness, that is going to make things > tricky > >> and complicated. > >> > >> > >> > >> > > >> > > >> > For the LAC protocol I think that there is no impact, the point is > that > >> the > >> > LastAddConfirmed is the max entryid which is known to have been > >> > acknowledged to the writer, so durability is not a concern. You can > >> loose > >> > entries even with fsynch, just by loosing all the disks which contains > >> the > >> > data. Without fsynch it is just more probable. > >> > > >> > >> I am against on relaxing durability for LAC protocol, because that is > the > >> foundation to correctness. > >> > >> I will perfer - advancing LAC only when entries are replicated and > durably > >> synced to disks. > >> > > > > Yes. Now I am convinced > > > >> > >> > >> > >> > > >> > Replication: maybe we should write in the ledger metadata that the > >> ledger > >> > allows this feature and deal with it. But I am not sure, I have to > >> > understand better how LaderHandleAdv deals with sparse entryids inside > >> the > >> > re-replication process > >> > > >> > >> replication should not be changed if we stick to same lac behavior. > >> > >> > >> > > >> > Mixed workload: honestly I would like to add this feature to limit the > >> > number of fsynch, and I expect to have lots of bursts of unsynched > >> entries > >> > to be interleaved with a few synched entries. I know that this feature > >> is > >> > not to be encouraged in general but only for specific cases, like the > >> > stories of LedgerHandleAdv or readUnconfirmedEntries > >> > > >> > If this makes sense to you I will create a BP and attach a first patch > >> > > >> > >> sure > >> > >> > >> > > >> > Enrico > >> > > >> > > >> > > >> > > >> > > >> > -- > >> > > >> > > >> > -- Enrico Olivelli > >> > > >> > > -- > > > > > > -- Enrico Olivelli > > > -- Jvrao --- First they ignore you, then they laugh at you, then they fight you, then you win. - Mahatma Gandhi