/cc (distributedlog-dev@) I know JV has similar use cases. This might require a broad discussion. The most tricky part would be LAC protocol - when can the client advance the LAC. I think a BP, initially with a google doc shared to the community would be good to start the discussion. because I would expect a lot points to discuss for this topic. Once we finalize the details, we can copy the google doc content back to the wiki page.
Other comments inline: On Thu, Aug 17, 2017 at 4:42 AM, Enrico Olivelli <eolive...@gmail.com> wrote: > Hi, > I am working with my colleagues at an implementation to relax the > constraint that every acknowledged entry must have been successfully > written and fsynced to disk at journal level. > > The idea is to have a flag in addEntry to ask for acknowledge not after the > fsync in journal but only when data has been successfully written and > flushed to the SO. > > I have the requirement that if an entry requires synch all the entries > successfully sent 'before' that entry (causality) are synched too, even if > they have been added with the new relaxed durability flag. > Imagine a database transaction log, during a transaction I will write every > change to data to the WAL with the new flag, and only the commit > transaction command will be added with synch requirement. The idea is that > all the changes inside the scope of the transaction have a meaning only if > the transaction is committed, so it is important that the commit entry > won't be lost and if that entry isn't lost all of the other entries of the > same transaction aren't lost too. > can you do: - lh.asyncAddEntry('entry-1') - lh.asyncAddEntry('entry-2') - lh.addEntry('commit') ? Does this work for you? If it doesn't, what is the problem? do you have any performance number to support why this doesn't work? > > I have another use case. In another project I am storing binary objects > into BK and I have to obtain great performance even on single disk bookie > layouts (journal + data + index on the same partition). In this project it > is acceptable to compensate the risk of not doing fsynch if requesting > enough replication. > IMHO it will be somehow like the Kakfa idea of durability, as far as I know > Kafka by default does not impose fsynch but it leaves all to the SO and to > the fact that there is a minimal configurable number of replicas which are > in-synch. when you are talking about kafka durability, what durability level are you looking for? Are you looking for replication durability without fsync? > > There are many open points, already suggested by Matteo, JV and Sijie: > - LAC protocol? > - replication in case of lost entries? > - under production load mixing non synched entries with synched entries > will not give much benefits > a couple thoughts to this feature: 1) we should always stick to a rule: LAC should only be advanced on receiving acknowledgement of entries (persist on disk after fsync, it can bypass journal if necessary). so all the assumptions for LAC, replication can remain same and no change is needed. 2) separate the acknowledgement of replication and the acknowledgement of fsync (LAC) can achieve 'replicated durability without fsync' while still maintain the correctness of LAC. That means: an add request (no-sync) can be completed after receiving enough responses from bookies, however the response of (no-sync) add can't advance LAC. The LAC can only be advanced on acknowledgement of sync adds. 3) request ordering and ensemble changes will make things complicated to ensure correctness. the elegancy of current replication durability with fsync is you don't rely on request ordering or physical layout to ensure ordering and correctness. However if you relax durability and mixing sync adds and fsync adds, you have to pay attention to request ordering and flush ordering to ensure correctness, that is going to make things tricky and complicated. > > > For the LAC protocol I think that there is no impact, the point is that the > LastAddConfirmed is the max entryid which is known to have been > acknowledged to the writer, so durability is not a concern. You can loose > entries even with fsynch, just by loosing all the disks which contains the > data. Without fsynch it is just more probable. > I am against on relaxing durability for LAC protocol, because that is the foundation to correctness. I will perfer - advancing LAC only when entries are replicated and durably synced to disks. > > Replication: maybe we should write in the ledger metadata that the ledger > allows this feature and deal with it. But I am not sure, I have to > understand better how LaderHandleAdv deals with sparse entryids inside the > re-replication process > replication should not be changed if we stick to same lac behavior. > > Mixed workload: honestly I would like to add this feature to limit the > number of fsynch, and I expect to have lots of bursts of unsynched entries > to be interleaved with a few synched entries. I know that this feature is > not to be encouraged in general but only for specific cases, like the > stories of LedgerHandleAdv or readUnconfirmedEntries > > If this makes sense to you I will create a BP and attach a first patch > sure > > Enrico > > > > > > -- > > > -- Enrico Olivelli >