On Fri, Oct 6, 2017 at 12:27 PM, Venkateswara Rao Jujjuri <jujj...@gmail.com > wrote:
> Thanks for the writeup Sijie, comments below. > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <guosi...@gmail.com> wrote: > > > I think the question is mainly around "how do we recognize the bookie" or > > "incarnations". And the purpose of a cookie is designed for addressing > > "incarnations". > > > > I will try to cover following aspects, and will try to answer questions > > that Ivan and JV raised. > > > > - what is cookie? > > - how the behavior became bad? > > - how do we fix current bad behavior? > > - is the cookie enough? > > > > > > *What is Cookie?* > > > > Cookie is originally introduced in this commit - > > https://github.com/apache/bookkeeper/commit/ > c6cc7cca3a85603c8e935ba6d06fbf > > 3d8d7a7eb5 > > . > > > > A cookie is a identifier of a bookie. A cookie is created on zookeeper > when > > a brand new bookie joint the cluster, the cookie is representing the > bookie > > instance > > during its lifecycle. The cookie is stored on all the disks for > > verification purpose. so if any of the disks misses the cookie (e.g. > disks > > were reformat or wiped out, > > disks are not mounted correctly), a bookie will reject to start. > > > > > > *How the behavior became bad?* > > > > The original behavior worked as expected to use the cookie in zookeeper > as > > the source of truth. See > > https://github.com/apache/bookkeeper/commit/ > c6cc7cca3a85603c8e935ba6d06fbf > > 3d8d7a7eb5 > > > > > > The behavior was changed at > > https://github.com/apache/bookkeeper/commit/ > 19b821c63b91293960041bca7b0316 > > 14a109a7b8 > > when trying to support both ip and hostname . It used journal directory > as > > the source-of-truth for verifying cookies. > > > > At the community meeting, I was saying a bookie should reject start when > a > > cookie file is missing locally and that was my operational experience. It > > turns out twitter's branch didn't include the change at > > 19b821c63b91293960041bca7b031614a109a7b8, > > so it was still the original behavior at > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 . > > > > *How do we fix current bad behavior?* > > > > We basically need to revert the current behaviour to the original > designed > > behavior. The cookie in zookeeper should be the source-of-truth for > > validation. > > > > If the cookie works as expected (change the behavior to the original > > behavior), then it is the operational or lifecycle management issue I > > explained above. > > > > If a bookie failed with missing cookie, it should be: > > > > 1. taken out of the cluster > > 2. run re-replication (autorecovery or manual recovery) > > 3. ensure no ledgers using this bookie any more > > 4. reformat the bookie > > 5. add it back > > > > This can be automated by hooking into a scheduler (like k8s or mesos). > But > > it requires some sort of lifecycle management in order to automate such > > operations. There is a BP-4: > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/ > > BP-4+-+BookKeeper+Lifecycle+Management > > proposed for this purpose. > > > > > > *Is the cookie enough?* > > > > Cookie (if we revert the current behavior to the original behavior), > should > > be able to address most of the issues related to "incarnations". > > > > There are still some corner cases will violate correctness issues. They > are > > related to "dangling writers" described in Ivan's first comment. > > > > How can a writer tell whether bookies changed or ledger changed when it > > gets network partitioned? > > > > 1) Bookie Changed. > > > > Bookie can be reformatted and re-added to the cluster. Ivan and JV > already > > touch this on adding UUID. > > > > I think the UUID doesn't have to be part of ledger metadata. because > > auditor and replication worker would use the lifecycle management for > > managing the lifecycle of bookies. > > > > You are suggesting that the 'manual/scripted' lifecycle tool is to the > rescue. > a side cart solution. > > But what are we saving by not keeping this info in the metadata? > metadata size? sure it is a huge win in ZK environment. > If you never add a malformed bookie back (removing cookie from zookeeper and reformat), then Cookie is already there to guarantee the correctness. So the real correctness issues are raised only when we are adding bookie back (removing the cookie from zookeeper). Allowing a bookie being added back, you need a clean lifecycle management in bookkeeper to ensure the correctness. It will address the problem "when we are okay to add a bookie back". The approach I am suggesting (ignoring what the implementation it is), is enforcing the rule - "*a bookie can only be added to a cluster only after no ledgers are referencing it*". It can be done by auditor and replication worker or whatever mechanisms that can achieve this. The approach you are suggesting -- adding UUID in the ledger metadata -- basically doesn't enforce any kind of this rule. It basically says you are okay to add a bookie back at any time. Because the UUID in the ledger metadata will handle it. In theory, it is correct. However, I have concerns around simplicity, operations and debuggability in this approach. That says, - it requires quite a bit changes on metadata format on clients and also replication. Backward compatibility is also a big concern. - when a problem occurs at midnight, it is going to become hard to investigate, because a ledger might potentially contains a bookie at its different lifecycles. >From operational perspective, I would prefer any simple solution. That says - i like the simplicity of enforcing "a bookie can only be added to a cluster only after no ledgers are referencing it", but I am fine with any approaches to take us there. > > > > > But the connection should have the UUID informations. > > > > By this you are suggesting service discovery portion need to have UUID > info > but not metadata portion. Won't it be confusing to handle a case where > write fails > on bookie because of UUID mismatch, and you may need to handle that case > and if you go back to the same bookie then no ensmeble changes. > > On the other hand if we introduce UUID into metadata, then we don't need to > be > explicitly depend on the side-cart solution. > If you keep "a bookie can only be added to a cluster only after no ledgers are referencing it" in the mind, then you will make sense of the approach putting uuid when establishing the connections. Because at any given time, there is only one bookie running at a given host:ip with its namespace uuid and instance uuid. The combination of namespace uuid and bookie uuid represents a bookie at a given cluster. The client uses these information to establish the connection. If you are carrying wrong uuids, that means you are connecting to a wrong cluster or some bad bookies, then no operations should succeed. It is just like 'authentication'. It is very simple, straightforward, apply for any type of requests in the bookie and easy to scale if we add new type of requests in future. > > > > Basically, any bookie client connects to a bookie, it needs to carry the > > namespace uuid and the bookie uuid to ensure bookie is connecting to a > > right bookie. This would prevent "dangling writers" connect to bookies > that > > are reformatted and added back. > > > > While this is an issue, the problem can only get exposed in pathological > scenario > where AQ bookies have went through this scenario, which is ~ 3 > I am not sure I understand the comment here. > > > 2) Ledger Changed. > > > > It is similar as what the case that Ivan' described. If a writer becomes > > 'network partitioned', and the ledger is deleted during this period, > after > > the writer comes back, the writer can still successfully write entries to > > the bookies, because the ledgers are already deleted and all the fencing > > bits are gone. > > > > This violates the expectation of "fencing". but I am not sure we need to > > spend time on fixing this, because the ledger is already explicitly > deleted > > by the application. so I think the behavior should be categorized as > > "undefined", just like "deleting a ledger when a writer is still writing > > entries" is a undefined behavior. > > > > > > To summarize my thought on this: > > > > 1. we need to revert the cookie behaviour to the original behavior. make > > sure the cookie works as expected. > > 2. introduce UUID or epoch in the cookie. client connection should carry > > namespace uuid and bookie uuid when establishing the connection. > > 3. work on BP-4 to have a complete lifecycle management to take bookie > out > > and add bookie out. > > > > 1 is the immediate fix, so correct operations can still guarantee the > > correctness. > > > > I agree we need to take care of #1 ASAP and have a Issues opened and > designs for #2 and #3. > > Thanks, > JV > > > > > - Sijie > > > > > > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri < > > jujj...@gmail.com> > > wrote: > > > > > > However, imagine that the fenced message is only in the journal on > b2, > > > > b2 crashes, something wipes the journal directory and then b2 comes > > > > back up. > > > > > > In this case what happened? > > > 1. We have WQ = 1 > > > 2. We had data loss (crash and comeup clean) > > > > > > But yeah, in addition to dataloss we have fencing violation too. > > > The problem is not just wiped journal dir, but how we recognize the > > bookie. > > > Bookie is just recognized by its ip address, not by its incarnation. > > > Bookie1 at T1 (b1t1) ; and same bookie1 at T2 after bookie format > (b1t2) > > > should be two different bookies, isn;t it? > > > this is needed for the replication worker and the auditor too. > > > > > > Also, bookie needs to know if the writer/reader is intended to read > from > > > b1t2 not from b1t1. > > > Looks like we have a hole here? Or I may not be fully understanding > > cookie > > > verification mechanism. > > > > > > Also as Ivan pointed out, we appear to think the lack of journal is > > > implicitly a new bookie, but overall cluster doesn't differentiate > > between > > > incarnations. > > > > > > Thanks, > > > JV > > > > > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org> wrote: > > > > > > > > The case you described here is "almost correct". But there is an > key > > > > here: > > > > > B2 can't startup itself if journal disk is wiped out, because the > > > cookie > > > > is > > > > > missed. > > > > This is what I expected to see, but isn't the case. > > > > <snip> > > > > List<Cookie> journalCookies = Lists.newArrayList(); > > > > // try to read cookie from journal directory. > > > > for (File journalDirectory : journalDirectories) { > > > > try { > > > > Cookie journalCookie = > > > > Cookie.readFromDirectory(journalDirectory); > > > > journalCookies.add(journalCookie); > > > > if (journalCookie.isBookieHostCreatedFromIp()) { > > > > conf.setUseHostNameAsBookieID(false); > > > > } else { > > > > conf.setUseHostNameAsBookieID(true); > > > > } > > > > } catch (FileNotFoundException fnf) { > > > > newEnv = true; > > > > missedCookieDirs.add(journalDirectory); > > > > } > > > > } > > > > </snip> > > > > > > > > So if a journal is missing the cookie, newEnv is set to true. This > > > > disabled the later checks. > > > > > > > > > Hower it can still happen in a different case: bit flap. In your > > case, > > > if > > > > > fence bit in b2 is already persisted on disk, but it got corrupted. > > > Then > > > > it > > > > > will cause the issue you described. One problem is we don't have > > > checksum > > > > > on the index file header when it stores those fence bits. > > > > Yes, this is also an issue. > > > > > > > > -Ivan > > > > > > > > > > > > > > > > -- > > > Jvrao > > > --- > > > First they ignore you, then they laugh at you, then they fight you, > then > > > you win. - Mahatma Gandhi > > > > > > > > > -- > Jvrao > --- > First they ignore you, then they laugh at you, then they fight you, then > you win. - Mahatma Gandhi >