okay, but why do you want to track the list of files? I don't get your idea here.
- Sijie On Sun, Oct 8, 2017 at 11:45 PM, Enrico Olivelli <eolive...@gmail.com> wrote: > 2017-10-09 7:52 GMT+02:00 Sijie Guo <guosi...@gmail.com>: > > > On Sat, Oct 7, 2017 at 9:53 AM, Enrico Olivelli <eolive...@gmail.com> > > wrote: > > > > > Il sab 7 ott 2017, 00:27 Sijie Guo <guosi...@gmail.com> ha scritto: > > > > > > > Enrico, > > > > > > > > Let's try to come to a conclusion or an agreement what we should fix > > and > > > > improve, before talking who is going to drive this. > > > > > > > > > > Sure. > > > > > > This is my point of view: > > > View have separate issues: > > > - missing checksums, to protect fence bits > > > - have a bug in bookie boot, we should not allow empty directories > > > - have a clear lifecycle for the bookie, add/remove > > > - deal with reincarnation of bookies > > > - ensuring the correctness of the contents of the directories of the > > bookie > > > > > > I would like to add a new point, we have rhe cookie inside every > > configured > > > directory managed by the bookie. > > > No cookie -> no boot > > > This will not be enough, we have to write in that file not only the > > > identity of the bookie but the list of files expected to be in the > > > directory. > > > This way you will not boot with a corrupted directory. > > > Config -> list of dirs -> list of files > > > > > > > I am not sure why this is a new point. This is exactly what cookie is > > doing, no? > > > > Sorry, I can't find such behavior in code on master brach > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Cookie.java > > I we have a copy of the cookie inside each directory (index + data + > journal) I mean that each file should carry the exact list of files > expected to be present in the directory at boot. > So for instance when you add a new file to the set of files on a journal > directory you must update the file in that directory, same for index, > data..... > > Maybe I am missing something. > It seems to me that cookie contains only a list a of directories not of > "files" > > Enrico > > > > > > > > > > > > > > I agree on the fact that the bookie should be added (bookie format) > only > > if > > > there is no reference to it in zk. > > > The bookie format operation should write the cookie in any configured > > > directory so that a bookie with empty directories won't ever start. > > > > > > I have to think more about this, but I wanted to share my first > thoughts > > > > > > Enrico > > > > > > > > > > - Sijie > > > > > > > > On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <eolive...@gmail.com > > > > > > wrote: > > > > > > > > > +1 for fixing the problem of missing cookie in 4.6 > > > > > > > > > > Who drives the issue? > > > > > > > > > > Thank you all for the interesting points > > > > > Enrico > > > > > > > > > > Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri < > jujj...@gmail.com > > > > > > ha > > > > > scritto: > > > > > > > > > > > Thanks for the writeup Sijie, comments below. > > > > > > > > > > > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <guosi...@gmail.com> > > > wrote: > > > > > > > > > > > > > I think the question is mainly around "how do we recognize the > > > > bookie" > > > > > or > > > > > > > "incarnations". And the purpose of a cookie is designed for > > > > addressing > > > > > > > "incarnations". > > > > > > > > > > > > > > I will try to cover following aspects, and will try to answer > > > > questions > > > > > > > that Ivan and JV raised. > > > > > > > > > > > > > > - what is cookie? > > > > > > > - how the behavior became bad? > > > > > > > - how do we fix current bad behavior? > > > > > > > - is the cookie enough? > > > > > > > > > > > > > > > > > > > > > *What is Cookie?* > > > > > > > > > > > > > > Cookie is originally introduced in this commit - > > > > > > > > > > > > > https://github.com/apache/bookkeeper/commit/ > > > > > c6cc7cca3a85603c8e935ba6d06fbf > > > > > > > 3d8d7a7eb5 > > > > > > > . > > > > > > > > > > > > > > A cookie is a identifier of a bookie. A cookie is created on > > > > zookeeper > > > > > > when > > > > > > > a brand new bookie joint the cluster, the cookie is > representing > > > the > > > > > > bookie > > > > > > > instance > > > > > > > during its lifecycle. The cookie is stored on all the disks for > > > > > > > verification purpose. so if any of the disks misses the cookie > > > (e.g. > > > > > > disks > > > > > > > were reformat or wiped out, > > > > > > > disks are not mounted correctly), a bookie will reject to > start. > > > > > > > > > > > > > > > > > > > > > *How the behavior became bad?* > > > > > > > > > > > > > > The original behavior worked as expected to use the cookie in > > > > zookeeper > > > > > > as > > > > > > > the source of truth. See > > > > > > > > > > > > > https://github.com/apache/bookkeeper/commit/ > > > > > c6cc7cca3a85603c8e935ba6d06fbf > > > > > > > 3d8d7a7eb5 > > > > > > > > > > > > > > > > > > > > > The behavior was changed at > > > > > > > > > > > > > https://github.com/apache/bookkeeper/commit/ > > > > > 19b821c63b91293960041bca7b0316 > > > > > > > 14a109a7b8 > > > > > > > when trying to support both ip and hostname . It used journal > > > > directory > > > > > > as > > > > > > > the source-of-truth for verifying cookies. > > > > > > > > > > > > > > At the community meeting, I was saying a bookie should reject > > start > > > > > when > > > > > > a > > > > > > > cookie file is missing locally and that was my operational > > > > experience. > > > > > It > > > > > > > turns out twitter's branch didn't include the change at > > > > > > > 19b821c63b91293960041bca7b031614a109a7b8, > > > > > > > so it was still the original behavior at > > > > > > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 . > > > > > > > > > > > > > > *How do we fix current bad behavior?* > > > > > > > > > > > > > > We basically need to revert the current behaviour to the > original > > > > > > designed > > > > > > > behavior. The cookie in zookeeper should be the source-of-truth > > for > > > > > > > validation. > > > > > > > > > > > > > > If the cookie works as expected (change the behavior to the > > > original > > > > > > > behavior), then it is the operational or lifecycle management > > > issue I > > > > > > > explained above. > > > > > > > > > > > > > > If a bookie failed with missing cookie, it should be: > > > > > > > > > > > > > > 1. taken out of the cluster > > > > > > > 2. run re-replication (autorecovery or manual recovery) > > > > > > > 3. ensure no ledgers using this bookie any more > > > > > > > 4. reformat the bookie > > > > > > > 5. add it back > > > > > > > > > > > > > > This can be automated by hooking into a scheduler (like k8s or > > > > mesos). > > > > > > But > > > > > > > it requires some sort of lifecycle management in order to > > automate > > > > such > > > > > > > operations. There is a BP-4: > > > > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/ > > > > > > > BP-4+-+BookKeeper+Lifecycle+Management > > > > > > > proposed for this purpose. > > > > > > > > > > > > > > > > > > > > > *Is the cookie enough?* > > > > > > > > > > > > > > Cookie (if we revert the current behavior to the original > > > behavior), > > > > > > should > > > > > > > be able to address most of the issues related to > "incarnations". > > > > > > > > > > > > > > There are still some corner cases will violate correctness > > issues. > > > > They > > > > > > are > > > > > > > related to "dangling writers" described in Ivan's first > comment. > > > > > > > > > > > > > > How can a writer tell whether bookies changed or ledger changed > > > when > > > > it > > > > > > > gets network partitioned? > > > > > > > > > > > > > > 1) Bookie Changed. > > > > > > > > > > > > > > Bookie can be reformatted and re-added to the cluster. Ivan and > > JV > > > > > > already > > > > > > > touch this on adding UUID. > > > > > > > > > > > > > > I think the UUID doesn't have to be part of ledger metadata. > > > because > > > > > > > auditor and replication worker would use the lifecycle > management > > > for > > > > > > > managing the lifecycle of bookies. > > > > > > > > > > > > > > > > > > > You are suggesting that the 'manual/scripted' lifecycle tool is > to > > > the > > > > > > rescue. > > > > > > a side cart solution. > > > > > > > > > > > > But what are we saving by not keeping this info in the metadata? > > > > > > metadata size? sure it is a huge win in ZK environment. > > > > > > > > > > > > > > > > > > > > But the connection should have the UUID informations. > > > > > > > > > > > > > > > > > > > By this you are suggesting service discovery portion need to > have > > > UUID > > > > > > info > > > > > > but not metadata portion. Won't it be confusing to handle a case > > > where > > > > > > write fails > > > > > > on bookie because of UUID mismatch, and you may need to handle > that > > > > case > > > > > > and if you go back to the same bookie then no ensmeble changes. > > > > > > > > > > > > On the other hand if we introduce UUID into metadata, then we > don't > > > > need > > > > > to > > > > > > be > > > > > > explicitly depend on the side-cart solution. > > > > > > > > > > > > > > > > > > > > > > > > > Basically, any bookie client connects to a bookie, it needs to > > > carry > > > > > the > > > > > > > namespace uuid and the bookie uuid to ensure bookie is > connecting > > > to > > > > a > > > > > > > right bookie. This would prevent "dangling writers" connect to > > > > bookies > > > > > > that > > > > > > > are reformatted and added back. > > > > > > > > > > > > > > While this is an issue, the problem can only get exposed in > > > > > pathological > > > > > > scenario > > > > > > where AQ bookies have went through this scenario, which is ~ 3 > > > > > > > > > > > > > > > > > > 2) Ledger Changed. > > > > > > > > > > > > > > It is similar as what the case that Ivan' described. If a > writer > > > > > becomes > > > > > > > 'network partitioned', and the ledger is deleted during this > > > period, > > > > > > after > > > > > > > the writer comes back, the writer can still successfully write > > > > entries > > > > > to > > > > > > > the bookies, because the ledgers are already deleted and all > the > > > > > fencing > > > > > > > bits are gone. > > > > > > > > > > > > > > This violates the expectation of "fencing". but I am not sure > we > > > need > > > > > to > > > > > > > spend time on fixing this, because the ledger is already > > explicitly > > > > > > deleted > > > > > > > by the application. so I think the behavior should be > categorized > > > as > > > > > > > "undefined", just like "deleting a ledger when a writer is > still > > > > > writing > > > > > > > entries" is a undefined behavior. > > > > > > > > > > > > > > > > > > > > > To summarize my thought on this: > > > > > > > > > > > > > > 1. we need to revert the cookie behaviour to the original > > behavior. > > > > > make > > > > > > > sure the cookie works as expected. > > > > > > > 2. introduce UUID or epoch in the cookie. client connection > > should > > > > > carry > > > > > > > namespace uuid and bookie uuid when establishing the > connection. > > > > > > > 3. work on BP-4 to have a complete lifecycle management to take > > > > bookie > > > > > > out > > > > > > > and add bookie out. > > > > > > > > > > > > > > 1 is the immediate fix, so correct operations can still > guarantee > > > the > > > > > > > correctness. > > > > > > > > > > > > > > > > > > > I agree we need to take care of #1 ASAP and have a Issues opened > > and > > > > > > designs for #2 and #3. > > > > > > > > > > > > Thanks, > > > > > > JV > > > > > > > > > > > > > > > > > > > > - Sijie > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri < > > > > > > > jujj...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > However, imagine that the fenced message is only in the > > journal > > > > on > > > > > > b2, > > > > > > > > > b2 crashes, something wipes the journal directory and then > b2 > > > > comes > > > > > > > > > back up. > > > > > > > > > > > > > > > > In this case what happened? > > > > > > > > 1. We have WQ = 1 > > > > > > > > 2. We had data loss (crash and comeup clean) > > > > > > > > > > > > > > > > But yeah, in addition to dataloss we have fencing violation > > too. > > > > > > > > The problem is not just wiped journal dir, but how we > recognize > > > the > > > > > > > bookie. > > > > > > > > Bookie is just recognized by its ip address, not by its > > > > incarnation. > > > > > > > > Bookie1 at T1 (b1t1) ; and same bookie1 at T2 after bookie > > > format > > > > > > (b1t2) > > > > > > > > should be two different bookies, isn;t it? > > > > > > > > this is needed for the replication worker and the auditor > too. > > > > > > > > > > > > > > > > Also, bookie needs to know if the writer/reader is intended > to > > > read > > > > > > from > > > > > > > > b1t2 not from b1t1. > > > > > > > > Looks like we have a hole here? Or I may not be fully > > > understanding > > > > > > > cookie > > > > > > > > verification mechanism. > > > > > > > > > > > > > > > > Also as Ivan pointed out, we appear to think the lack of > > journal > > > is > > > > > > > > implicitly a new bookie, but overall cluster doesn't > > > differentiate > > > > > > > between > > > > > > > > incarnations. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > JV > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org > > > > > > wrote: > > > > > > > > > > > > > > > > > > The case you described here is "almost correct". But > there > > is > > > > an > > > > > > key > > > > > > > > > here: > > > > > > > > > > B2 can't startup itself if journal disk is wiped out, > > because > > > > the > > > > > > > > cookie > > > > > > > > > is > > > > > > > > > > missed. > > > > > > > > > This is what I expected to see, but isn't the case. > > > > > > > > > <snip> > > > > > > > > > List<Cookie> journalCookies = Lists.newArrayList(); > > > > > > > > > // try to read cookie from journal directory. > > > > > > > > > for (File journalDirectory : > journalDirectories) > > { > > > > > > > > > try { > > > > > > > > > Cookie journalCookie = > > > > > > > > > Cookie.readFromDirectory(journalDirectory); > > > > > > > > > journalCookies.add(journalCookie); > > > > > > > > > if > > > > (journalCookie.isBookieHostCreatedFromIp()) > > > > > { > > > > > > > > > conf.setUseHostNameAsBookieID( > > false); > > > > > > > > > } else { > > > > > > > > > conf.setUseHostNameAsBookieID( > true); > > > > > > > > > } > > > > > > > > > } catch (FileNotFoundException fnf) { > > > > > > > > > newEnv = true; > > > > > > > > > missedCookieDirs.add( > journalDirectory); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > </snip> > > > > > > > > > > > > > > > > > > So if a journal is missing the cookie, newEnv is set to > true. > > > > This > > > > > > > > > disabled the later checks. > > > > > > > > > > > > > > > > > > > Hower it can still happen in a different case: bit flap. > In > > > > your > > > > > > > case, > > > > > > > > if > > > > > > > > > > fence bit in b2 is already persisted on disk, but it got > > > > > corrupted. > > > > > > > > Then > > > > > > > > > it > > > > > > > > > > will cause the issue you described. One problem is we > don't > > > > have > > > > > > > > checksum > > > > > > > > > > on the index file header when it stores those fence bits. > > > > > > > > > Yes, this is also an issue. > > > > > > > > > > > > > > > > > > -Ivan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jvrao > > > > > > > > --- > > > > > > > > First they ignore you, then they laugh at you, then they > fight > > > you, > > > > > > then > > > > > > > > you win. - Mahatma Gandhi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jvrao > > > > > > --- > > > > > > First they ignore you, then they laugh at you, then they fight > you, > > > > then > > > > > > you win. - Mahatma Gandhi > > > > > > > > > > > -- > > > > > > > > > > > > > > > -- Enrico Olivelli > > > > > > > > > > > > -- > > > > > > > > > -- Enrico Olivelli > > > > > >