Thank you Sijie for summarizing and thanks to the community for helping in
this important enhancement to BookKeeper

I am convinced that as JV pointed out we need to declare at ledger creation
time that the ledger is going to perform no-sync writes.

I think we need an explicit declaration currently to make things "clear" to
the developer which is using the LedgerHandle API even and ledger creation
tyime.

The case is that we are going to forbid "striping" ledgers (ensemble size >
quorum size) for no-sync writes in the first implementation:
- one option is to  fail at the first no-sync addEntry, but this will be
really uncomfortable because usually the ack/write/ensemble sizes are
configured by the admin, and there will be configurations in which errors
will come out only after starting the system.
- the second option is to make the developer explicitly enable no-sync
writes at creation time and fail the creation of the ledger if the
requested combination of options if not possible

I am not sure that the changes to the bookie internals are a Client-API
matter, maybe we can leverage custom metadata (as JV said) in order to make
the bookie handle ledgers in a different manner, this way will be always
open as custom metadata are already here.

JV preferred the ledger-type approach, the dual solution is to introduce a
list of "capabilities" or "ledger options".
I think that this ability to perform no-syc writes is so important that
"custom metadata" is not the good place to declare it, same for "ledger
type"

So I am proposing to add a boolean 'allowNoSyncWrites" at ledger creation
time, without writing in to ledger metadata on ZK,
I think that if further improvements will need ledger metadata changes we
will do.

I have updated the BP-14 document, I have added an "Open issues" footer
with the open points,
please add comments and I will correct the document as soon as possible.


Enrico




2017-08-30 1:24 GMT+02:00 Sijie Guo <guosi...@gmail.com>:

> Thank you, Enrico, JV.
>
> These are great discussions.
>
> After reading these two proposals, I have a few very high-level comments,
> dividing into three categories.
>
>
> *API*
>
> - I think there are not fundamentally differences between these two
> proposals.
> They are trying to achieve similar goals by exposing durability levels in
> different way.
> So this will be a discussion on what API/interface should look like from
> user / admin perspective.
> I would suggest focusing what would be the API itself, putting the
> implementation design aside when talking about this.
>
> *Core*
>
> - Both proposals need to deal with a core function - what happen to LAC and
> what semantic that bookkeeper provides.
> JV did a good summary in his proposal. However I am not a fan of
> maintaining two different semantics. So I am looking for
> a solution that bookkeeper can only maintain one semantic. The semantic is
> basically:
>
> 1) LAC only advanced when entries before LAC are committed to the
> persistent storage
> 2) All the entries until LAC are successfully committed to the persistence
> storage
> 3) Entries until LAC: all the entries must be readable all the time.
>
> If we maintain such semantic, there is no need to change the auto recovery
> protocol in bookkeeper. All what we guarantee are the entries durably
> persistent.
>
> In order to maintain such semantic, I think both me and JV proposed similar
> solution in either proposal. I am trying to finalize one here:
>
> * bookie maintains a LAS (Last Add Synced) point for each entry.
> * LAS can be piggybacked on AddResponses
> * Client uses the LAS to advance LAC.
>
> If we can agree on the core semantic we are going to provide, the other
> things are just logistics.
>
> *Others*
>
> - Regarding separating journal or bypassing journal, there is no difference
> when we talking from the core semantic. They are all non-durably writes
> (acknowledging before fsyncing).
> We can start with same journal approach (but just acknowledge before
> fsyncing), implement the core and add other options later on.
>
>
> From my point of view, I'd be more interesting in providing a single
> consistent durable semantic that application can rely on for both durable
> writes and non-durable writes. The other stuffs seem to be more logistics
> things.
>
>
> - Sijie
>
>
> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <eolive...@gmail.com>
> wrote:
>
> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <jujj...@gmail.com>:
> >
> > > I don't believe I fully followed your second case. But even in this
> case,
> > > your major concern is about the additional 'sync' RPC?
> > >
> >
> > yes apart from that I am fine with your proposal too, that is to have a
> > LedgerType which drives durability
> > and I think we need to add per-entry durability options
> >
> > I think that at least for the 'simple' no-sync addEntry we do not need to
> > change many things, I am drafting a prototype, I will share it as soon as
> > we all agree on the roadmap
> >
> > The first implementation can cover the first cases (no-sync addEntry) and
> > change the way the writer advances the LAC in order to support 'relaxed
> > durability writes'.
> > This change will be compatible with future improvements and it will open
> > the door for big changes on the bookie side like bypassing the journal or
> > leveraging multiple journals.....
> >
> > -- Enrico
> >
> > or something else that the LedgerType proposal won't work?
> > >
> >
> > >
> > >
> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <eolive...@gmail.com>
> > > wrote:
> > >
> > > > I think that having a set of options on the ledger metadata will be a
> > > good
> > > > enhancement and I am sure we will do it as soon as it will be needed,
> > > maybe
> > > > we do not need it now.
> > > >
> > > > Actually I think we will need to declare this durability-level at
> entry
> > > > level to support some uses cases in BP-14 document, let me explain
> two
> > of
> > > > my usecases for which I need it:
> > > >
> > > > At higher level we have to choices:
> > > >
> > > > A) per-ledger durability options (JV proposal)
> > > > all addEntry operations are durable or non-durable and there is an
> > > explicit
> > > > 'sync' API (+ forced sync at close)
> > > >
> > > > B) per-entry durability options (original BP-14 proposal)
> > > > every addEntry has an own durable/non-durable option (sync/no-sync),
> > with
> > > > the ability to call 'sync' without addEntry (+ forced sync at close)
> > > >
> > > > I am speaking about the the database WAL case, I am using the ledger
> as
> > > > segment for the WAL of a database and I am writing all data changes
> in
> > > the
> > > > scope of a 'transaction' with the relaxed-durability flag, then I am
> > > > writing the 'transaction committed' entry with "strict durability"
> > > > requirement, this will in fact require that all previous entries are
> > > > persisted durably and so that the transaction will never be lost.
> > > >
> > > > In this scenario we would need an addEntry + sync API in fact:
> > > >
> > > > using option  A) the WAL will look like:
> > > > - open ledger no-sync = true
> > > > - addEntry (set foo=bar)  (this will be no-sync)
> > > > - addEntry (set foo=bar2) (this will be no-sync)
> > > > - addEntry (commit)
> > > > - sync
> > > >
> > > > using option B) the WAL will look like
> > > > - open ledger
> > > > - addEntry (set foo=bar), no-sync
> > > > - addEntry (set foo=bar2), no-sync
> > > > - addEntry (commit), sync
> > > >
> > > > in case B) we are "saving" one RPC call to every bookie (the 'sync'
> > one)
> > > > same for single data change entries, like updating a single record on
> > the
> > > > database, this with BK 4.5 "costs" only a single RPC to every bookie
> > > >
> > > > Second case:
> > > > I am using BookKeeper to store binary objects, so I am packing more
> > > > 'objects' (named sequences of bytes) into a single ledger, like you
> do
> > > when
> > > > you write many records to a file in a streaming fashion and keep
> track
> > of
> > > > offsets of the beginning of every record (LedgerHandeAdv is perfect
> for
> > > > this case).
> > > > I am not using a single ledger per 'file' because it kills zookeeper
> to
> > > > create many ledgers very fast, in my systems I have big busts of
> > writes,
> > > > which need to be really "fast", so I am writing multiple 'files' to
> > every
> > > > single ledger. So the close-to-open consistency at ledger level is
> not
> > > > suitable for this case.
> > > > I have to write as fast as possible to this 'ledger-backed' stream,
> and
> > > as
> > > > with a 'traditional'  filesystem I am writing parts of each file and
> > than
> > > > requiring 'sync' at the end of each file.
> > > > Using BookKeeper you need to split big 'files' into "little" parts,
> you
> > > > cannot transmit the contents as to "real" stream on network.
> > > >
> > > > I am not talking about bookie level implementation details I would
> like
> > > to
> > > > define the high level API in order to support all the relevant known
> > use
> > > > cases and keep space for the future,
> > > > at this moment adding a per-entry 'durability option' seems to be
> very
> > > > flexible and simple to implement, it does not prevent us from doing
> > > further
> > > > improvements, like namely skipping the journal.
> > > >
> > > > Enrico
> > > >
> > > >
> > > >
> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>:
> > > >
> > > > >
> > > > >
> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> > jujj...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> As promised during Thursday call, here is my proposal.
> > > > >>
> > > > >> *NOTE*: Major difference in this proposal compared to Enrico’s
> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> > > > >> is
> > > > >> making the durability a property of the ledger(type) as opposed to
> > > > >> addEntry(). Rest of the technical details have a lot of
> > similarities.
> > > > >>
> > > > >
> > > > > Thank you JV. I have just read quickly the doc and your view is
> > > centantly
> > > > > broader.
> > > > > I will dig into the doc as soon as possible on Monday.
> > > > > For me it is ok to have a ledger wide configuration I think that
> the
> > > most
> > > > > important decision is about the API we will provide as in the
> future
> > it
> > > > > will be difficult to change it.
> > > > >
> > > > >
> > > > > Cheers
> > > > > Enrico
> > > > >
> > > > >
> > > > >
> > > > >> https://docs.google.com/document/d/1g1eBcVVCZrTG8YZliZP0LVqvWpq43
> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> > > > >>
> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> > eolive...@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> > Thank you all for the comments and for taking a look to the
> > document
> > > > so
> > > > >> > soon.
> > > > >> > I have updated the doc, we will discuss the document at the
> > meeting,
> > > > >> >
> > > > >> >
> > > > >> > Enrico
> > > > >> >
> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <guosi...@gmail.com>:
> > > > >> >
> > > > >> > > Enrico,
> > > > >> > >
> > > > >> > > Thank you so much! It is a great effort for putting this up.
> > > Overall
> > > > >> > looks
> > > > >> > > good. I made some comments, we can discuss at tomorrow's
> > community
> > > > >> > meeting.
> > > > >> > >
> > > > >> > > - Sijie
> > > > >> > >
> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> > > > eolive...@gmail.com
> > > > >> >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > > I have drafted a first proposal for BP-14 - Relax Durability
> > > > >> > > >
> > > > >> > > > We are talking about limiting the number of fsync to the
> > journal
> > > > >> while
> > > > >> > > > preserving the correctness of the LAC protocol.
> > > > >> > > >
> > > > >> > > > This is the link to the wiki page, but as the issue is huge
> we
> > > > >> prefer
> > > > >> > to
> > > > >> > > > use Google Documents for sharing comments
> > > > >> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > >> > > > BP+-+14+Relax+durability
> > > > >> > > >
> > > > >> > > > This is the document
> > > > >> > > > https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> > > > >> > > >
> > > > >> > > > All comments are welcome
> > > > >> > > >
> > > > >> > > > I have added DL dev list in cc as the discussion is
> > interesting
> > > > for
> > > > >> > both
> > > > >> > > > groups
> > > > >> > > >
> > > > >> > > > Enrico Olivelli
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Jvrao
> > > > >> ---
> > > > >> First they ignore you, then they laugh at you, then they fight
> you,
> > > then
> > > > >> you win. - Mahatma Gandhi
> > > > >>
> > > > > --
> > > > >
> > > > >
> > > > > -- Enrico Olivelli
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>

Reply via email to