> The limit of the given patches is that it is simply skipping all of the > writes to the journal, and this in turn is a big problem: > - if you restart the bookie it is likely that you lose your data, and > especially the 'fenced' flag > - clients cannot rely on most of the guarantees that BK provides
There are two problems (restatement of above). - A bookie may accept writes for a ledger which it has previously promised not to (loss of the fenced bit) - A bookie may reply negatively for the read of a ledgers entry, which is has previously acknowledged receipt of (breaks consistency guarantees) In both cases, the problem is unclosed ledgers. If the bookie, when it starts, can detect a non-clean shutdown. If it does, it can find all unclosed ledgers which were writing to it, and a) accept not more writes b) not reply negatively to requests for entries of those ledgers which do not exist on the bookie. a) is similar in effect to fencing. If a client was actively writing to the ledger, it should have updated the ensemble by that time in any case. b) is a new concept (lets call it limbo). If the entries do not exist locally, they may still have existed previously. So to respond negatively would be untrue and messes up the recovery process. As you mentioned, splunk already has this change internally. I'm going to start another thread about that. In summary, skipping the journal is fine if you have some other things in place. However, I would make it a cluster wide property. If we say skipping the journal is safe (due to multi AZ and the extra checks) then it should be safe for all. -Ivan Also another problem is that those implementations work on a per-bookie > basis, I understand that the user in those cases is Pulsar and usually you > do not share your BK cluster with other applications (is it really true ? > think about PulsarFunctions and BK StreamStorage service....). > > Btw this is not true for our case at EmailSuccess.com and also at > MagNews.com, in which we are sharing the bookies with other components > (like HerdDB, DistributedLog, BlobIt). > > Skipping the journal is a good trade off in several cases, because it makes > writes blazing fast and also reduces the write amplification. > > I would like to wrap up all of this stuff and provide a feature to BK, to > be used consistently by all of the users. > > I think that it will be far better to have a WriteFlag to enable this > feature, this way different clients will be able to express their > durability constraints and service level regarding this feature. > > Also when the Bookie is not writing to the Journal, after a restart, we > should tell to the clients that the Bookie is not able to return data for a > given ledger or to tell if the ledger has been fenced. IIUC Ivan and Matteo > already have this change in their private fork. > > > I will be happy to start a BP or to help any other volunteer in writing it. > We should work as a community on this topic. > > Thoughts ? > Enrico