Re: Relax durability

Venkateswara Rao Jujjuri Thu, 31 Aug 2017 09:53:55 -0700

Hi all,

It has been a great and lively discussion. I can say this is one of the
highly trended topics in the recent BK community discussion.
Kudos to Enrico for starting this.


Enrico, Sijie and I met and discussed this further and came up with the
following consensus on how to move forward.

* Introduce LedgerType/LedgerProperties which goes into ZK metadata.
* No changes to AddEntry API (application view); but AddEntry RPC will add
a flag to bookies to inform about the type/durability.
* Introduce a sync() RPC which needs to be called explicitly on RD ledgers.
* No changes to LAC and how we update it.
* No changes to the behavior of readEntries() API, which reads only until
LAC.
* Applications can use readUnConcirmed API to read until last add pushed.
* Segregate stats based on the ledgertype.


Enrico is going to merge two docs and publish a detailed design. Thanks a
lot Enrico


On Mon, Aug 21, 2017 at 10:01 PM, Sijie Guo <guosi...@gmail.com> wrote:

> On Aug 21, 2017 5:44 AM, "Enrico Olivelli" <eolive...@gmail.com> wrote:
>
> As the issue is really huge, I need to narrow the design and implementation
> efforts to a specific case at the moment: I am interested in having a
> per-ledger flag to not require fsynch on entries on journal.
>
>
> It is good to narrow down the implementation. However because there are
> different requirements from different people. It would be good to discuss
> and cover all thoughts.
>
>
> If the "no-synch" flag is applied per ledger than we have to decide what to
> do on the LAC protocol, I see two opposite ways:
> 1) the LAC will never advanced (no fsynch is guaranteed on journal)
> 2) the LAC is advanced as usual but it will be possible to have missing
> entries
>
>
> Personally I am -1 to approach 2) as for the reasons I stated in previous
> emails.
>
>
> There is a "gray" situation:
> 3) as entries will be interleaved on the journal with entries of other
> "synch" ledgers it will be possible to detect some kind of "synched"
> entries and return the info to the writing client which in turn will be
> able to advance the LAC:
> this option is not useful as the behavior is unpredictable
>
> For my "urgent" usecase I would prefer 2), but 1) is possible too, because
> I am using LedgerHandlerAdv (I have manual allocation of entry ids) +
> readUnconfirmedEntries (which allows to read entries even if LAC did not
> advance)
>
>
> As JV suggested, please start the design doc and let's iterate over it
> before the implementation.
>
>
> -- Enrico
>
>
> 2017-08-19 14:09 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>:
>
> >
> >
> > On ven 18 ago 2017, 20:12 Sijie Guo <guosi...@gmail.com> wrote:
> >
> >> /cc (distributedlog-dev@)
> >>
> >> I know JV has similar use cases. This might require a broad discussion.
> >> The
> >> most tricky part would be LAC protocol - when can the client advance the
> >> LAC. I think a BP, initially with a google doc shared to the community
> >> would be good to start the discussion. because I would expect a lot
> points
> >> to discuss for this topic. Once we finalize the details, we can copy the
> >> google doc content back to the wiki page.
> >>
> >
> > Thank you Sijie and JV for pointing me to the right direction.
> > I had underestimated the problems related to the ensemble changes, and
> > also effectively in my projects  it can happen that a single
> 'transaction'
> > can span more then one ledger so the ordering issues are nore complex
> than
> > I expected. If somehow it would be possible to keep ordering inside the
> > scope of a single ledger it is very hard to get it using multiple
> ledgers.
> >
> > Next week I will write the doc, but I think I am going to split the
> > problem into multiple parts.
> > I see that the LAC must be advanced only when an fsynch is done. This
> will
> > preserve correctness as Sijie told.
> >
> > I think that the problems related to the ordering of events must be
> > addressed at application level and it would be the best thing to have
> such
> > support in DL.
> >
> > For instance at first glance I omage that we should add in BK some
> support
> > in order to let the application receive notifications of changes to LAC
> to
> > the writer more easily.
> >
> > The first step would be to add a new flag to addEntry to receive
> > acknowledge on fwrite and flush (with the needed changes to the journal),
> > and in the addresponse a flag wjich tells that the entry has been synched
> > or only flushed, and handle the LAC according to this information.
> >
> > Other comments inline
> > Enrico
> >
> >
> >
> >
> >
> >> Other comments inline:
> >>
> >>
> >> On Thu, Aug 17, 2017 at 4:42 AM, Enrico Olivelli <eolive...@gmail.com>
> >> wrote:
> >>
> >> > Hi,
> >> > I am working with my colleagues at an implementation to relax the
> >> > constraint that every acknowledged entry must have been successfully
> >> > written and fsynced to disk at journal level.
> >> >
> >> > The idea is to have a flag in addEntry to ask for acknowledge not
> after
> >> the
> >> > fsync in journal but only when data has been successfully written and
> >> > flushed to the SO.
> >> >
> >> > I have the requirement that if an entry requires synch all the entries
> >> > successfully sent 'before' that entry (causality) are synched too,
> even
> >> if
> >> > they have been added with the new relaxed durability flag.
> >>
> >>
> >> > Imagine a database transaction log, during a transaction I will write
> >> every
> >> > change to data to the WAL with the new flag, and only the commit
> >> > transaction command will be added with synch requirement. The idea is
> >> that
> >> > all the changes inside the scope of the transaction have a meaning
> only
> >> if
> >> > the transaction is committed, so it is important that the commit entry
> >> > won't be lost and if that entry isn't lost all of the other entries of
> >> the
> >> > same transaction aren't lost too.
> >> >
> >>
> >> can you do:
> >>
> >> - lh.asyncAddEntry('entry-1')
> >> - lh.asyncAddEntry('entry-2')
> >> - lh.addEntry('commit')
> >>
> >> ?
> >>
> >
> > Yes, currently ut is the best we can do and I am doing so
> >
> >
> >> Does this work for you? If it doesn't, what is the problem? do you have
> >> any
> >> performance number to support why this doesn't work?
> >>
> >
> > I do not have numbers for this case, ingeneral limiting the number for
> > fsynch could bring better performances.
> > It is hard to play with grouping settings in the journal
> >
> >
> >>
> >> >
> >> > I have another use case. In another project I am storing binary
> objects
> >> > into BK and I have to obtain great performance even on single disk
> >> bookie
> >> > layouts (journal + data + index on the same partition).
> >>
> >> In this project it
> >> > is acceptable to compensate the risk of not doing fsynch if requesting
> >> > enough replication.
> >> > IMHO it will be somehow like the Kakfa idea of durability, as far as I
> >> know
> >> > Kafka by default does not impose fsynch but it leaves all to the SO
> and
> >> to
> >> > the fact that there is a minimal configurable number of replicas which
> >> are
> >> > in-synch.
> >>
> >>
> >>
> >> when you are talking about kafka durability, what durability level are
> you
> >> looking for? Are you looking for replication durability without fsync?
> >>
> >
> > Yes, the clients waits for acks from a number of brokers, which do not
> > necessarily have performed fsynch. Dataloss risk is mitigated by
> replication
> >
> >
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> >
> >> > There are many open points, already suggested by Matteo, JV and Sijie:
> >> > - LAC protocol?
> >> > - replication in case of lost entries?
> >> > - under production load mixing non synched entries with synched
> entries
> >> > will not give much benefits
> >> >
> >>
> >> a couple thoughts to this feature:
> >>
> >> 1) we should always stick to a rule: LAC should only be advanced on
> >> receiving acknowledgement of entries (persist on disk after fsync, it
> can
> >> bypass journal if necessary). so all the assumptions for LAC,
> replication
> >> can remain same and no change is needed.
> >>
> >> 2) separate the acknowledgement of replication and the acknowledgement
> of
> >> fsync (LAC) can achieve 'replicated durability without fsync' while
> still
> >> maintain the correctness of LAC. That means:
> >>
> >> an add request (no-sync) can be completed after receiving enough
> responses
> >> from bookies, however the response of (no-sync) add can't advance LAC.
> The
> >> LAC can only be advanced on acknowledgement of sync adds.
> >>
> >>
> >> 3) request ordering and ensemble changes will make things complicated to
> >> ensure correctness. the elegancy of current replication durability with
> >> fsync is you don't rely on request ordering or physical layout to ensure
> >> ordering and correctness. However if you relax durability and mixing
> sync
> >> adds and fsync adds, you have to pay attention to request  ordering and
> >> flush ordering to ensure correctness, that is going to make things
> tricky
> >> and complicated.
> >>
> >>
> >>
> >> >
> >> >
> >> > For the LAC protocol I think that there is no impact, the point is
> that
> >> the
> >> > LastAddConfirmed is the max entryid which is known to have been
> >> > acknowledged to the writer, so durability is not a concern. You can
> >> loose
> >> > entries even with fsynch, just by loosing all the disks which contains
> >> the
> >> > data. Without fsynch it is just more probable.
> >> >
> >>
> >> I am against on relaxing durability for LAC protocol, because that is
> the
> >> foundation to correctness.
> >>
> >> I will perfer - advancing LAC only when entries are replicated and
> durably
> >> synced to disks.
> >>
> >
> > Yes. Now I am convinced
> >
> >>
> >>
> >>
> >> >
> >> > Replication: maybe we should write in the ledger metadata that the
> >> ledger
> >> > allows this feature and deal with it. But I am not sure, I have to
> >> > understand better how LaderHandleAdv deals with sparse entryids inside
> >> the
> >> > re-replication process
> >> >
> >>
> >> replication should not be changed if we stick to same lac behavior.
> >>
> >>
> >> >
> >> > Mixed workload: honestly I would like to add this feature to limit the
> >> > number of fsynch, and I expect to have lots of bursts of unsynched
> >> entries
> >> > to be interleaved with a few synched entries. I know that this feature
> >> is
> >> > not to be encouraged in general but only for specific cases, like the
> >> > stories of LedgerHandleAdv or readUnconfirmedEntries
> >> >
> >> > If this makes sense to you I will create a BP and attach a first patch
> >> >
> >>
> >> sure
> >>
> >>
> >> >
> >> > Enrico
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> >
> >> > -- Enrico Olivelli
> >> >
> >>
> > --
> >
> >
> > -- Enrico Olivelli
> >
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: Relax durability

Reply via email to