Re: Relax durability

Sijie Guo Mon, 21 Aug 2017 22:01:45 -0700

On Aug 21, 2017 5:44 AM, "Enrico Olivelli" <eolive...@gmail.com> wrote:


As the issue is really huge, I need to narrow the design and implementation
efforts to a specific case at the moment: I am interested in having a
per-ledger flag to not require fsynch on entries on journal.


It is good to narrow down the implementation. However because there are
different requirements from different people. It would be good to discuss
and cover all thoughts.


If the "no-synch" flag is applied per ledger than we have to decide what to
do on the LAC protocol, I see two opposite ways:
1) the LAC will never advanced (no fsynch is guaranteed on journal)
2) the LAC is advanced as usual but it will be possible to have missing
entries


Personally I am -1 to approach 2) as for the reasons I stated in previous
emails.


There is a "gray" situation:
3) as entries will be interleaved on the journal with entries of other
"synch" ledgers it will be possible to detect some kind of "synched"
entries and return the info to the writing client which in turn will be
able to advance the LAC:
this option is not useful as the behavior is unpredictable

For my "urgent" usecase I would prefer 2), but 1) is possible too, because
I am using LedgerHandlerAdv (I have manual allocation of entry ids) +
readUnconfirmedEntries (which allows to read entries even if LAC did not
advance)


As JV suggested, please start the design doc and let's iterate over it
before the implementation.


-- Enrico


2017-08-19 14:09 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>:

>
>
> On ven 18 ago 2017, 20:12 Sijie Guo <guosi...@gmail.com> wrote:
>
>> /cc (distributedlog-dev@)
>>
>> I know JV has similar use cases. This might require a broad discussion.
>> The
>> most tricky part would be LAC protocol - when can the client advance the
>> LAC. I think a BP, initially with a google doc shared to the community
>> would be good to start the discussion. because I would expect a lot
points
>> to discuss for this topic. Once we finalize the details, we can copy the
>> google doc content back to the wiki page.
>>
>
> Thank you Sijie and JV for pointing me to the right direction.
> I had underestimated the problems related to the ensemble changes, and
> also effectively in my projects  it can happen that a single 'transaction'
> can span more then one ledger so the ordering issues are nore complex than
> I expected. If somehow it would be possible to keep ordering inside the
> scope of a single ledger it is very hard to get it using multiple ledgers.
>
> Next week I will write the doc, but I think I am going to split the
> problem into multiple parts.
> I see that the LAC must be advanced only when an fsynch is done. This will
> preserve correctness as Sijie told.
>
> I think that the problems related to the ordering of events must be
> addressed at application level and it would be the best thing to have such
> support in DL.
>
> For instance at first glance I omage that we should add in BK some support
> in order to let the application receive notifications of changes to LAC to
> the writer more easily.
>
> The first step would be to add a new flag to addEntry to receive
> acknowledge on fwrite and flush (with the needed changes to the journal),
> and in the addresponse a flag wjich tells that the entry has been synched
> or only flushed, and handle the LAC according to this information.
>
> Other comments inline
> Enrico
>
>
>
>
>
>> Other comments inline:
>>
>>
>> On Thu, Aug 17, 2017 at 4:42 AM, Enrico Olivelli <eolive...@gmail.com>
>> wrote:
>>
>> > Hi,
>> > I am working with my colleagues at an implementation to relax the
>> > constraint that every acknowledged entry must have been successfully
>> > written and fsynced to disk at journal level.
>> >
>> > The idea is to have a flag in addEntry to ask for acknowledge not after
>> the
>> > fsync in journal but only when data has been successfully written and
>> > flushed to the SO.
>> >
>> > I have the requirement that if an entry requires synch all the entries
>> > successfully sent 'before' that entry (causality) are synched too, even
>> if
>> > they have been added with the new relaxed durability flag.
>>
>>
>> > Imagine a database transaction log, during a transaction I will write
>> every
>> > change to data to the WAL with the new flag, and only the commit
>> > transaction command will be added with synch requirement. The idea is
>> that
>> > all the changes inside the scope of the transaction have a meaning only
>> if
>> > the transaction is committed, so it is important that the commit entry
>> > won't be lost and if that entry isn't lost all of the other entries of
>> the
>> > same transaction aren't lost too.
>> >
>>
>> can you do:
>>
>> - lh.asyncAddEntry('entry-1')
>> - lh.asyncAddEntry('entry-2')
>> - lh.addEntry('commit')
>>
>> ?
>>
>
> Yes, currently ut is the best we can do and I am doing so
>
>
>> Does this work for you? If it doesn't, what is the problem? do you have
>> any
>> performance number to support why this doesn't work?
>>
>
> I do not have numbers for this case, ingeneral limiting the number for
> fsynch could bring better performances.
> It is hard to play with grouping settings in the journal
>
>
>>
>> >
>> > I have another use case. In another project I am storing binary objects
>> > into BK and I have to obtain great performance even on single disk
>> bookie
>> > layouts (journal + data + index on the same partition).
>>
>> In this project it
>> > is acceptable to compensate the risk of not doing fsynch if requesting
>> > enough replication.
>> > IMHO it will be somehow like the Kakfa idea of durability, as far as I
>> know
>> > Kafka by default does not impose fsynch but it leaves all to the SO and
>> to
>> > the fact that there is a minimal configurable number of replicas which
>> are
>> > in-synch.
>>
>>
>>
>> when you are talking about kafka durability, what durability level are
you
>> looking for? Are you looking for replication durability without fsync?
>>
>
> Yes, the clients waits for acks from a number of brokers, which do not
> necessarily have performed fsynch. Dataloss risk is mitigated by
replication
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>> >
>> > There are many open points, already suggested by Matteo, JV and Sijie:
>> > - LAC protocol?
>> > - replication in case of lost entries?
>> > - under production load mixing non synched entries with synched entries
>> > will not give much benefits
>> >
>>
>> a couple thoughts to this feature:
>>
>> 1) we should always stick to a rule: LAC should only be advanced on
>> receiving acknowledgement of entries (persist on disk after fsync, it can
>> bypass journal if necessary). so all the assumptions for LAC, replication
>> can remain same and no change is needed.
>>
>> 2) separate the acknowledgement of replication and the acknowledgement of
>> fsync (LAC) can achieve 'replicated durability without fsync' while still
>> maintain the correctness of LAC. That means:
>>
>> an add request (no-sync) can be completed after receiving enough
responses
>> from bookies, however the response of (no-sync) add can't advance LAC.
The
>> LAC can only be advanced on acknowledgement of sync adds.
>>
>>
>> 3) request ordering and ensemble changes will make things complicated to
>> ensure correctness. the elegancy of current replication durability with
>> fsync is you don't rely on request ordering or physical layout to ensure
>> ordering and correctness. However if you relax durability and mixing sync
>> adds and fsync adds, you have to pay attention to request  ordering and
>> flush ordering to ensure correctness, that is going to make things tricky
>> and complicated.
>>
>>
>>
>> >
>> >
>> > For the LAC protocol I think that there is no impact, the point is that
>> the
>> > LastAddConfirmed is the max entryid which is known to have been
>> > acknowledged to the writer, so durability is not a concern. You can
>> loose
>> > entries even with fsynch, just by loosing all the disks which contains
>> the
>> > data. Without fsynch it is just more probable.
>> >
>>
>> I am against on relaxing durability for LAC protocol, because that is the
>> foundation to correctness.
>>
>> I will perfer - advancing LAC only when entries are replicated and
durably
>> synced to disks.
>>
>
> Yes. Now I am convinced
>
>>
>>
>>
>> >
>> > Replication: maybe we should write in the ledger metadata that the
>> ledger
>> > allows this feature and deal with it. But I am not sure, I have to
>> > understand better how LaderHandleAdv deals with sparse entryids inside
>> the
>> > re-replication process
>> >
>>
>> replication should not be changed if we stick to same lac behavior.
>>
>>
>> >
>> > Mixed workload: honestly I would like to add this feature to limit the
>> > number of fsynch, and I expect to have lots of bursts of unsynched
>> entries
>> > to be interleaved with a few synched entries. I know that this feature
>> is
>> > not to be encouraged in general but only for specific cases, like the
>> > stories of LedgerHandleAdv or readUnconfirmedEntries
>> >
>> > If this makes sense to you I will create a BP and attach a first patch
>> >
>>
>> sure
>>
>>
>> >
>> > Enrico
>> >
>> >
>> >
>> >
>> >
>> > --
>> >
>> >
>> > -- Enrico Olivelli
>> >
>>
> --
>
>
> -- Enrico Olivelli
>

Re: Relax durability

Reply via email to