Hi all,

You can find the revised proposal here
https://cwiki.apache.org/confluence/display/BOOKKEEPER/BP-14+Relax+durability

The link to the document open for comments is this:
https://docs.google.com/document/d/1yNi9t2_deOOMXDaGzrnmaHTQeB3B3Fnym82DUERH7LM/edit?usp=sharing

Please check it out
We are going to review this Proposal at the meeting

-- Enrico


2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>:

> Thank you Sijie for summarizing and thanks to the community for helping in
> this important enhancement to BookKeeper
>
> I am convinced that as JV pointed out we need to declare at ledger
> creation time that the ledger is going to perform no-sync writes.
>
> I think we need an explicit declaration currently to make things "clear"
> to the developer which is using the LedgerHandle API even and ledger
> creation tyime.
>
> The case is that we are going to forbid "striping" ledgers (ensemble size
> > quorum size) for no-sync writes in the first implementation:
> - one option is to  fail at the first no-sync addEntry, but this will be
> really uncomfortable because usually the ack/write/ensemble sizes are
> configured by the admin, and there will be configurations in which errors
> will come out only after starting the system.
> - the second option is to make the developer explicitly enable no-sync
> writes at creation time and fail the creation of the ledger if the
> requested combination of options if not possible
>
> I am not sure that the changes to the bookie internals are a Client-API
> matter, maybe we can leverage custom metadata (as JV said) in order to make
> the bookie handle ledgers in a different manner, this way will be always
> open as custom metadata are already here.
>
> JV preferred the ledger-type approach, the dual solution is to introduce a
> list of "capabilities" or "ledger options".
> I think that this ability to perform no-syc writes is so important that
> "custom metadata" is not the good place to declare it, same for "ledger
> type"
>
> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger creation
> time, without writing in to ledger metadata on ZK,
> I think that if further improvements will need ledger metadata changes we
> will do.
>
> I have updated the BP-14 document, I have added an "Open issues" footer
> with the open points,
> please add comments and I will correct the document as soon as possible.
>
>
> Enrico
>
>
>
>
> 2017-08-30 1:24 GMT+02:00 Sijie Guo <guosi...@gmail.com>:
>
>> Thank you, Enrico, JV.
>>
>> These are great discussions.
>>
>> After reading these two proposals, I have a few very high-level comments,
>> dividing into three categories.
>>
>>
>> *API*
>>
>> - I think there are not fundamentally differences between these two
>> proposals.
>> They are trying to achieve similar goals by exposing durability levels in
>> different way.
>> So this will be a discussion on what API/interface should look like from
>> user / admin perspective.
>> I would suggest focusing what would be the API itself, putting the
>> implementation design aside when talking about this.
>>
>> *Core*
>>
>> - Both proposals need to deal with a core function - what happen to LAC
>> and
>> what semantic that bookkeeper provides.
>> JV did a good summary in his proposal. However I am not a fan of
>> maintaining two different semantics. So I am looking for
>> a solution that bookkeeper can only maintain one semantic. The semantic is
>> basically:
>>
>> 1) LAC only advanced when entries before LAC are committed to the
>> persistent storage
>> 2) All the entries until LAC are successfully committed to the persistence
>> storage
>> 3) Entries until LAC: all the entries must be readable all the time.
>>
>> If we maintain such semantic, there is no need to change the auto recovery
>> protocol in bookkeeper. All what we guarantee are the entries durably
>> persistent.
>>
>> In order to maintain such semantic, I think both me and JV proposed
>> similar
>> solution in either proposal. I am trying to finalize one here:
>>
>> * bookie maintains a LAS (Last Add Synced) point for each entry.
>> * LAS can be piggybacked on AddResponses
>> * Client uses the LAS to advance LAC.
>>
>> If we can agree on the core semantic we are going to provide, the other
>> things are just logistics.
>>
>> *Others*
>>
>> - Regarding separating journal or bypassing journal, there is no
>> difference
>> when we talking from the core semantic. They are all non-durably writes
>> (acknowledging before fsyncing).
>> We can start with same journal approach (but just acknowledge before
>> fsyncing), implement the core and add other options later on.
>>
>>
>> From my point of view, I'd be more interesting in providing a single
>> consistent durable semantic that application can rely on for both durable
>> writes and non-durable writes. The other stuffs seem to be more logistics
>> things.
>>
>>
>> - Sijie
>>
>>
>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <eolive...@gmail.com>
>> wrote:
>>
>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <jujj...@gmail.com>:
>> >
>> > > I don't believe I fully followed your second case. But even in this
>> case,
>> > > your major concern is about the additional 'sync' RPC?
>> > >
>> >
>> > yes apart from that I am fine with your proposal too, that is to have a
>> > LedgerType which drives durability
>> > and I think we need to add per-entry durability options
>> >
>> > I think that at least for the 'simple' no-sync addEntry we do not need
>> to
>> > change many things, I am drafting a prototype, I will share it as soon
>> as
>> > we all agree on the roadmap
>> >
>> > The first implementation can cover the first cases (no-sync addEntry)
>> and
>> > change the way the writer advances the LAC in order to support 'relaxed
>> > durability writes'.
>> > This change will be compatible with future improvements and it will open
>> > the door for big changes on the bookie side like bypassing the journal
>> or
>> > leveraging multiple journals.....
>> >
>> > -- Enrico
>> >
>> > or something else that the LedgerType proposal won't work?
>> > >
>> >
>> > >
>> > >
>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <eolive...@gmail.com
>> >
>> > > wrote:
>> > >
>> > > > I think that having a set of options on the ledger metadata will be
>> a
>> > > good
>> > > > enhancement and I am sure we will do it as soon as it will be
>> needed,
>> > > maybe
>> > > > we do not need it now.
>> > > >
>> > > > Actually I think we will need to declare this durability-level at
>> entry
>> > > > level to support some uses cases in BP-14 document, let me explain
>> two
>> > of
>> > > > my usecases for which I need it:
>> > > >
>> > > > At higher level we have to choices:
>> > > >
>> > > > A) per-ledger durability options (JV proposal)
>> > > > all addEntry operations are durable or non-durable and there is an
>> > > explicit
>> > > > 'sync' API (+ forced sync at close)
>> > > >
>> > > > B) per-entry durability options (original BP-14 proposal)
>> > > > every addEntry has an own durable/non-durable option (sync/no-sync),
>> > with
>> > > > the ability to call 'sync' without addEntry (+ forced sync at close)
>> > > >
>> > > > I am speaking about the the database WAL case, I am using the
>> ledger as
>> > > > segment for the WAL of a database and I am writing all data changes
>> in
>> > > the
>> > > > scope of a 'transaction' with the relaxed-durability flag, then I am
>> > > > writing the 'transaction committed' entry with "strict durability"
>> > > > requirement, this will in fact require that all previous entries are
>> > > > persisted durably and so that the transaction will never be lost.
>> > > >
>> > > > In this scenario we would need an addEntry + sync API in fact:
>> > > >
>> > > > using option  A) the WAL will look like:
>> > > > - open ledger no-sync = true
>> > > > - addEntry (set foo=bar)  (this will be no-sync)
>> > > > - addEntry (set foo=bar2) (this will be no-sync)
>> > > > - addEntry (commit)
>> > > > - sync
>> > > >
>> > > > using option B) the WAL will look like
>> > > > - open ledger
>> > > > - addEntry (set foo=bar), no-sync
>> > > > - addEntry (set foo=bar2), no-sync
>> > > > - addEntry (commit), sync
>> > > >
>> > > > in case B) we are "saving" one RPC call to every bookie (the 'sync'
>> > one)
>> > > > same for single data change entries, like updating a single record
>> on
>> > the
>> > > > database, this with BK 4.5 "costs" only a single RPC to every bookie
>> > > >
>> > > > Second case:
>> > > > I am using BookKeeper to store binary objects, so I am packing more
>> > > > 'objects' (named sequences of bytes) into a single ledger, like you
>> do
>> > > when
>> > > > you write many records to a file in a streaming fashion and keep
>> track
>> > of
>> > > > offsets of the beginning of every record (LedgerHandeAdv is perfect
>> for
>> > > > this case).
>> > > > I am not using a single ledger per 'file' because it kills
>> zookeeper to
>> > > > create many ledgers very fast, in my systems I have big busts of
>> > writes,
>> > > > which need to be really "fast", so I am writing multiple 'files' to
>> > every
>> > > > single ledger. So the close-to-open consistency at ledger level is
>> not
>> > > > suitable for this case.
>> > > > I have to write as fast as possible to this 'ledger-backed' stream,
>> and
>> > > as
>> > > > with a 'traditional'  filesystem I am writing parts of each file and
>> > than
>> > > > requiring 'sync' at the end of each file.
>> > > > Using BookKeeper you need to split big 'files' into "little" parts,
>> you
>> > > > cannot transmit the contents as to "real" stream on network.
>> > > >
>> > > > I am not talking about bookie level implementation details I would
>> like
>> > > to
>> > > > define the high level API in order to support all the relevant known
>> > use
>> > > > cases and keep space for the future,
>> > > > at this moment adding a per-entry 'durability option' seems to be
>> very
>> > > > flexible and simple to implement, it does not prevent us from doing
>> > > further
>> > > > improvements, like namely skipping the journal.
>> > > >
>> > > > Enrico
>> > > >
>> > > >
>> > > >
>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>:
>> > > >
>> > > > >
>> > > > >
>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
>> > jujj...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> Hi all,
>> > > > >>
>> > > > >> As promised during Thursday call, here is my proposal.
>> > > > >>
>> > > > >> *NOTE*: Major difference in this proposal compared to Enrico’s
>> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
>> > > > >> is
>> > > > >> making the durability a property of the ledger(type) as opposed
>> to
>> > > > >> addEntry(). Rest of the technical details have a lot of
>> > similarities.
>> > > > >>
>> > > > >
>> > > > > Thank you JV. I have just read quickly the doc and your view is
>> > > centantly
>> > > > > broader.
>> > > > > I will dig into the doc as soon as possible on Monday.
>> > > > > For me it is ok to have a ledger wide configuration I think that
>> the
>> > > most
>> > > > > important decision is about the API we will provide as in the
>> future
>> > it
>> > > > > will be difficult to change it.
>> > > > >
>> > > > >
>> > > > > Cheers
>> > > > > Enrico
>> > > > >
>> > > > >
>> > > > >
>> > > > >> https://docs.google.com/document/d/1g1eBcVVCZrTG8YZliZP0LVqv
>> Wpq43
>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
>> > > > >>
>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
>> > eolive...@gmail.com
>> > > >
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Thank you all for the comments and for taking a look to the
>> > document
>> > > > so
>> > > > >> > soon.
>> > > > >> > I have updated the doc, we will discuss the document at the
>> > meeting,
>> > > > >> >
>> > > > >> >
>> > > > >> > Enrico
>> > > > >> >
>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <guosi...@gmail.com>:
>> > > > >> >
>> > > > >> > > Enrico,
>> > > > >> > >
>> > > > >> > > Thank you so much! It is a great effort for putting this up.
>> > > Overall
>> > > > >> > looks
>> > > > >> > > good. I made some comments, we can discuss at tomorrow's
>> > community
>> > > > >> > meeting.
>> > > > >> > >
>> > > > >> > > - Sijie
>> > > > >> > >
>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
>> > > > eolive...@gmail.com
>> > > > >> >
>> > > > >> > > wrote:
>> > > > >> > >
>> > > > >> > > > Hi all,
>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
>> Durability
>> > > > >> > > >
>> > > > >> > > > We are talking about limiting the number of fsync to the
>> > journal
>> > > > >> while
>> > > > >> > > > preserving the correctness of the LAC protocol.
>> > > > >> > > >
>> > > > >> > > > This is the link to the wiki page, but as the issue is
>> huge we
>> > > > >> prefer
>> > > > >> > to
>> > > > >> > > > use Google Documents for sharing comments
>> > > > >> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
>> > > > >> > > > BP+-+14+Relax+durability
>> > > > >> > > >
>> > > > >> > > > This is the document
>> > > > >> > > > https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
>> > > > >> > > >
>> > > > >> > > > All comments are welcome
>> > > > >> > > >
>> > > > >> > > > I have added DL dev list in cc as the discussion is
>> > interesting
>> > > > for
>> > > > >> > both
>> > > > >> > > > groups
>> > > > >> > > >
>> > > > >> > > > Enrico Olivelli
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Jvrao
>> > > > >> ---
>> > > > >> First they ignore you, then they laugh at you, then they fight
>> you,
>> > > then
>> > > > >> you win. - Mahatma Gandhi
>> > > > >>
>> > > > > --
>> > > > >
>> > > > >
>> > > > > -- Enrico Olivelli
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Jvrao
>> > > ---
>> > > First they ignore you, then they laugh at you, then they fight you,
>> then
>> > > you win. - Mahatma Gandhi
>> > >
>> >
>>
>
>

Reply via email to