Hi all,
You can find the revised proposal here https://cwiki.apache.org/confluence/display/BOOKKEEPER/BP-14+Relax+durability The link to the document open for comments is this: https://docs.google.com/document/d/1yNi9t2_deOOMXDaGzrnmaHTQeB3B3Fnym82DUERH7LM/edit?usp=sharing Please check it out We are going to review this Proposal at the meeting -- Enrico 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>: > Thank you Sijie for summarizing and thanks to the community for helping in > this important enhancement to BookKeeper > > I am convinced that as JV pointed out we need to declare at ledger > creation time that the ledger is going to perform no-sync writes. > > I think we need an explicit declaration currently to make things "clear" > to the developer which is using the LedgerHandle API even and ledger > creation tyime. > > The case is that we are going to forbid "striping" ledgers (ensemble size > > quorum size) for no-sync writes in the first implementation: > - one option is to fail at the first no-sync addEntry, but this will be > really uncomfortable because usually the ack/write/ensemble sizes are > configured by the admin, and there will be configurations in which errors > will come out only after starting the system. > - the second option is to make the developer explicitly enable no-sync > writes at creation time and fail the creation of the ledger if the > requested combination of options if not possible > > I am not sure that the changes to the bookie internals are a Client-API > matter, maybe we can leverage custom metadata (as JV said) in order to make > the bookie handle ledgers in a different manner, this way will be always > open as custom metadata are already here. > > JV preferred the ledger-type approach, the dual solution is to introduce a > list of "capabilities" or "ledger options". > I think that this ability to perform no-syc writes is so important that > "custom metadata" is not the good place to declare it, same for "ledger > type" > > So I am proposing to add a boolean 'allowNoSyncWrites" at ledger creation > time, without writing in to ledger metadata on ZK, > I think that if further improvements will need ledger metadata changes we > will do. > > I have updated the BP-14 document, I have added an "Open issues" footer > with the open points, > please add comments and I will correct the document as soon as possible. > > > Enrico > > > > > 2017-08-30 1:24 GMT+02:00 Sijie Guo <guosi...@gmail.com>: > >> Thank you, Enrico, JV. >> >> These are great discussions. >> >> After reading these two proposals, I have a few very high-level comments, >> dividing into three categories. >> >> >> *API* >> >> - I think there are not fundamentally differences between these two >> proposals. >> They are trying to achieve similar goals by exposing durability levels in >> different way. >> So this will be a discussion on what API/interface should look like from >> user / admin perspective. >> I would suggest focusing what would be the API itself, putting the >> implementation design aside when talking about this. >> >> *Core* >> >> - Both proposals need to deal with a core function - what happen to LAC >> and >> what semantic that bookkeeper provides. >> JV did a good summary in his proposal. However I am not a fan of >> maintaining two different semantics. So I am looking for >> a solution that bookkeeper can only maintain one semantic. The semantic is >> basically: >> >> 1) LAC only advanced when entries before LAC are committed to the >> persistent storage >> 2) All the entries until LAC are successfully committed to the persistence >> storage >> 3) Entries until LAC: all the entries must be readable all the time. >> >> If we maintain such semantic, there is no need to change the auto recovery >> protocol in bookkeeper. All what we guarantee are the entries durably >> persistent. >> >> In order to maintain such semantic, I think both me and JV proposed >> similar >> solution in either proposal. I am trying to finalize one here: >> >> * bookie maintains a LAS (Last Add Synced) point for each entry. >> * LAS can be piggybacked on AddResponses >> * Client uses the LAS to advance LAC. >> >> If we can agree on the core semantic we are going to provide, the other >> things are just logistics. >> >> *Others* >> >> - Regarding separating journal or bypassing journal, there is no >> difference >> when we talking from the core semantic. They are all non-durably writes >> (acknowledging before fsyncing). >> We can start with same journal approach (but just acknowledge before >> fsyncing), implement the core and add other options later on. >> >> >> From my point of view, I'd be more interesting in providing a single >> consistent durable semantic that application can rely on for both durable >> writes and non-durable writes. The other stuffs seem to be more logistics >> things. >> >> >> - Sijie >> >> >> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <eolive...@gmail.com> >> wrote: >> >> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <jujj...@gmail.com>: >> > >> > > I don't believe I fully followed your second case. But even in this >> case, >> > > your major concern is about the additional 'sync' RPC? >> > > >> > >> > yes apart from that I am fine with your proposal too, that is to have a >> > LedgerType which drives durability >> > and I think we need to add per-entry durability options >> > >> > I think that at least for the 'simple' no-sync addEntry we do not need >> to >> > change many things, I am drafting a prototype, I will share it as soon >> as >> > we all agree on the roadmap >> > >> > The first implementation can cover the first cases (no-sync addEntry) >> and >> > change the way the writer advances the LAC in order to support 'relaxed >> > durability writes'. >> > This change will be compatible with future improvements and it will open >> > the door for big changes on the bookie side like bypassing the journal >> or >> > leveraging multiple journals..... >> > >> > -- Enrico >> > >> > or something else that the LedgerType proposal won't work? >> > > >> > >> > > >> > > >> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <eolive...@gmail.com >> > >> > > wrote: >> > > >> > > > I think that having a set of options on the ledger metadata will be >> a >> > > good >> > > > enhancement and I am sure we will do it as soon as it will be >> needed, >> > > maybe >> > > > we do not need it now. >> > > > >> > > > Actually I think we will need to declare this durability-level at >> entry >> > > > level to support some uses cases in BP-14 document, let me explain >> two >> > of >> > > > my usecases for which I need it: >> > > > >> > > > At higher level we have to choices: >> > > > >> > > > A) per-ledger durability options (JV proposal) >> > > > all addEntry operations are durable or non-durable and there is an >> > > explicit >> > > > 'sync' API (+ forced sync at close) >> > > > >> > > > B) per-entry durability options (original BP-14 proposal) >> > > > every addEntry has an own durable/non-durable option (sync/no-sync), >> > with >> > > > the ability to call 'sync' without addEntry (+ forced sync at close) >> > > > >> > > > I am speaking about the the database WAL case, I am using the >> ledger as >> > > > segment for the WAL of a database and I am writing all data changes >> in >> > > the >> > > > scope of a 'transaction' with the relaxed-durability flag, then I am >> > > > writing the 'transaction committed' entry with "strict durability" >> > > > requirement, this will in fact require that all previous entries are >> > > > persisted durably and so that the transaction will never be lost. >> > > > >> > > > In this scenario we would need an addEntry + sync API in fact: >> > > > >> > > > using option A) the WAL will look like: >> > > > - open ledger no-sync = true >> > > > - addEntry (set foo=bar) (this will be no-sync) >> > > > - addEntry (set foo=bar2) (this will be no-sync) >> > > > - addEntry (commit) >> > > > - sync >> > > > >> > > > using option B) the WAL will look like >> > > > - open ledger >> > > > - addEntry (set foo=bar), no-sync >> > > > - addEntry (set foo=bar2), no-sync >> > > > - addEntry (commit), sync >> > > > >> > > > in case B) we are "saving" one RPC call to every bookie (the 'sync' >> > one) >> > > > same for single data change entries, like updating a single record >> on >> > the >> > > > database, this with BK 4.5 "costs" only a single RPC to every bookie >> > > > >> > > > Second case: >> > > > I am using BookKeeper to store binary objects, so I am packing more >> > > > 'objects' (named sequences of bytes) into a single ledger, like you >> do >> > > when >> > > > you write many records to a file in a streaming fashion and keep >> track >> > of >> > > > offsets of the beginning of every record (LedgerHandeAdv is perfect >> for >> > > > this case). >> > > > I am not using a single ledger per 'file' because it kills >> zookeeper to >> > > > create many ledgers very fast, in my systems I have big busts of >> > writes, >> > > > which need to be really "fast", so I am writing multiple 'files' to >> > every >> > > > single ledger. So the close-to-open consistency at ledger level is >> not >> > > > suitable for this case. >> > > > I have to write as fast as possible to this 'ledger-backed' stream, >> and >> > > as >> > > > with a 'traditional' filesystem I am writing parts of each file and >> > than >> > > > requiring 'sync' at the end of each file. >> > > > Using BookKeeper you need to split big 'files' into "little" parts, >> you >> > > > cannot transmit the contents as to "real" stream on network. >> > > > >> > > > I am not talking about bookie level implementation details I would >> like >> > > to >> > > > define the high level API in order to support all the relevant known >> > use >> > > > cases and keep space for the future, >> > > > at this moment adding a per-entry 'durability option' seems to be >> very >> > > > flexible and simple to implement, it does not prevent us from doing >> > > further >> > > > improvements, like namely skipping the journal. >> > > > >> > > > Enrico >> > > > >> > > > >> > > > >> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <eolive...@gmail.com>: >> > > > >> > > > > >> > > > > >> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri < >> > jujj...@gmail.com> >> > > > > wrote: >> > > > > >> > > > >> Hi all, >> > > > >> >> > > > >> As promised during Thursday call, here is my proposal. >> > > > >> >> > > > >> *NOTE*: Major difference in this proposal compared to Enrico’s >> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_- >> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v> >> > > > >> is >> > > > >> making the durability a property of the ledger(type) as opposed >> to >> > > > >> addEntry(). Rest of the technical details have a lot of >> > similarities. >> > > > >> >> > > > > >> > > > > Thank you JV. I have just read quickly the doc and your view is >> > > centantly >> > > > > broader. >> > > > > I will dig into the doc as soon as possible on Monday. >> > > > > For me it is ok to have a ledger wide configuration I think that >> the >> > > most >> > > > > important decision is about the API we will provide as in the >> future >> > it >> > > > > will be difficult to change it. >> > > > > >> > > > > >> > > > > Cheers >> > > > > Enrico >> > > > > >> > > > > >> > > > > >> > > > >> https://docs.google.com/document/d/1g1eBcVVCZrTG8YZliZP0LVqv >> Wpq43 >> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing >> > > > >> >> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli < >> > eolive...@gmail.com >> > > > >> > > > >> wrote: >> > > > >> >> > > > >> > Thank you all for the comments and for taking a look to the >> > document >> > > > so >> > > > >> > soon. >> > > > >> > I have updated the doc, we will discuss the document at the >> > meeting, >> > > > >> > >> > > > >> > >> > > > >> > Enrico >> > > > >> > >> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <guosi...@gmail.com>: >> > > > >> > >> > > > >> > > Enrico, >> > > > >> > > >> > > > >> > > Thank you so much! It is a great effort for putting this up. >> > > Overall >> > > > >> > looks >> > > > >> > > good. I made some comments, we can discuss at tomorrow's >> > community >> > > > >> > meeting. >> > > > >> > > >> > > > >> > > - Sijie >> > > > >> > > >> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli < >> > > > eolive...@gmail.com >> > > > >> > >> > > > >> > > wrote: >> > > > >> > > >> > > > >> > > > Hi all, >> > > > >> > > > I have drafted a first proposal for BP-14 - Relax >> Durability >> > > > >> > > > >> > > > >> > > > We are talking about limiting the number of fsync to the >> > journal >> > > > >> while >> > > > >> > > > preserving the correctness of the LAC protocol. >> > > > >> > > > >> > > > >> > > > This is the link to the wiki page, but as the issue is >> huge we >> > > > >> prefer >> > > > >> > to >> > > > >> > > > use Google Documents for sharing comments >> > > > >> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/ >> > > > >> > > > BP+-+14+Relax+durability >> > > > >> > > > >> > > > >> > > > This is the document >> > > > >> > > > https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_- >> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing >> > > > >> > > > >> > > > >> > > > All comments are welcome >> > > > >> > > > >> > > > >> > > > I have added DL dev list in cc as the discussion is >> > interesting >> > > > for >> > > > >> > both >> > > > >> > > > groups >> > > > >> > > > >> > > > >> > > > Enrico Olivelli >> > > > >> > > > >> > > > >> > > >> > > > >> > >> > > > >> >> > > > >> >> > > > >> >> > > > >> -- >> > > > >> Jvrao >> > > > >> --- >> > > > >> First they ignore you, then they laugh at you, then they fight >> you, >> > > then >> > > > >> you win. - Mahatma Gandhi >> > > > >> >> > > > > -- >> > > > > >> > > > > >> > > > > -- Enrico Olivelli >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > Jvrao >> > > --- >> > > First they ignore you, then they laugh at you, then they fight you, >> then >> > > you win. - Mahatma Gandhi >> > > >> > >> > >