+1 for fixing the problem of missing cookie in 4.6

Who drives the issue?

Thank you all for the interesting points
Enrico

Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <jujj...@gmail.com> ha
scritto:

> Thanks for the writeup Sijie, comments below.
>
> On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <guosi...@gmail.com> wrote:
>
> > I think the question is mainly around "how do we recognize the bookie" or
> > "incarnations". And the purpose of a cookie is designed for addressing
> > "incarnations".
> >
> > I will try to cover following aspects, and will try to answer questions
> > that Ivan and JV raised.
> >
> > - what is cookie?
> > - how the behavior became bad?
> > - how do we fix current bad behavior?
> > - is the cookie enough?
> >
> >
> > *What is Cookie?*
> >
> > Cookie is originally introduced in this commit -
> >
> https://github.com/apache/bookkeeper/commit/c6cc7cca3a85603c8e935ba6d06fbf
> > 3d8d7a7eb5
> > .
> >
> > A cookie is a identifier of a bookie. A cookie is created on zookeeper
> when
> > a brand new bookie joint the cluster, the cookie is representing the
> bookie
> > instance
> > during its lifecycle. The cookie is stored on all the disks for
> > verification purpose. so if any of the disks misses the cookie (e.g.
> disks
> > were reformat or wiped out,
> > disks are not mounted correctly), a bookie will reject to start.
> >
> >
> > *How the behavior became bad?*
> >
> > The original behavior worked as expected to use the cookie in zookeeper
> as
> > the source of truth. See
> >
> https://github.com/apache/bookkeeper/commit/c6cc7cca3a85603c8e935ba6d06fbf
> > 3d8d7a7eb5
> >
> >
> > The behavior was changed at
> >
> https://github.com/apache/bookkeeper/commit/19b821c63b91293960041bca7b0316
> > 14a109a7b8
> > when trying to support both ip and hostname . It used journal directory
> as
> > the source-of-truth for verifying cookies.
> >
> > At the community meeting, I was saying a bookie should reject start when
> a
> > cookie file is missing locally and that was my operational experience. It
> > turns out twitter's branch didn't include the change at
> > 19b821c63b91293960041bca7b031614a109a7b8,
> > so it was still the original behavior at
> > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> >
> > *How do we fix current bad behavior?*
> >
> > We basically need to revert the current behaviour to the original
> designed
> > behavior. The cookie in zookeeper should be the source-of-truth for
> > validation.
> >
> > If the cookie works as expected (change the behavior to the original
> > behavior), then it is the operational or lifecycle management issue I
> > explained above.
> >
> > If a bookie failed with missing cookie, it should be:
> >
> > 1. taken out of the cluster
> > 2. run re-replication (autorecovery or manual recovery)
> > 3. ensure no ledgers using this bookie any more
> > 4. reformat the bookie
> > 5. add it back
> >
> > This can be automated by hooking into a scheduler (like k8s or mesos).
> But
> > it requires some sort of lifecycle management in order to automate such
> > operations. There is a BP-4:
> > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > BP-4+-+BookKeeper+Lifecycle+Management
> > proposed for this purpose.
> >
> >
> > *Is the cookie enough?*
> >
> > Cookie (if we revert the current behavior to the original behavior),
> should
> > be able to address most of the issues related to "incarnations".
> >
> > There are still some corner cases will violate correctness issues. They
> are
> > related to "dangling writers" described in Ivan's first comment.
> >
> > How can a writer tell whether bookies changed or ledger changed when it
> > gets network partitioned?
> >
> > 1) Bookie Changed.
> >
> > Bookie can be reformatted and re-added to the cluster. Ivan and JV
> already
> > touch this on adding UUID.
> >
> > I think the UUID doesn't have to be part of ledger metadata. because
> > auditor and replication worker would use the lifecycle management for
> > managing the lifecycle of bookies.
> >
>
> You are suggesting that the 'manual/scripted' lifecycle tool is to the
> rescue.
> a side cart solution.
>
> But what are we saving by not keeping this info in the metadata?
> metadata size? sure it is a huge win in ZK environment.
>
> >
> > But the connection should have the UUID informations.
> >
>
> By this you are suggesting  service discovery portion need to have UUID
> info
> but not metadata portion. Won't it be confusing to handle a case where
> write fails
> on bookie because of UUID mismatch, and you may need to handle that case
> and if you go back to the same bookie then no ensmeble changes.
>
> On the other hand if we introduce UUID into metadata, then we don't need to
> be
> explicitly depend on the side-cart solution.
>
>
>
> > Basically, any bookie client connects to a bookie, it needs to carry the
> > namespace uuid and the bookie uuid to ensure bookie is connecting to a
> > right bookie. This would prevent "dangling writers" connect to bookies
> that
> > are reformatted and added back.
> >
> >  While this is an issue, the problem can only get exposed in pathological
> scenario
> where AQ bookies have went through this scenario, which is ~ 3
>
>
> 2) Ledger Changed.
> >
> > It is similar as what the case that Ivan' described. If a writer becomes
> > 'network partitioned', and the ledger is deleted during this period,
> after
> > the writer comes back, the writer can still successfully write entries to
> > the bookies, because the ledgers are already deleted and all the fencing
> > bits are gone.
> >
> > This violates the expectation of "fencing". but I am not sure we need to
> > spend time on fixing this, because the ledger is already explicitly
> deleted
> > by the application. so I think the behavior should be categorized as
> > "undefined", just like "deleting a ledger when a writer is still writing
> > entries" is a undefined behavior.
> >
> >
> > To summarize my thought on this:
> >
> > 1. we need to revert the cookie behaviour to the original behavior. make
> > sure the cookie works as expected.
> > 2. introduce UUID or epoch in the cookie. client connection should carry
> > namespace uuid and bookie uuid when establishing the connection.
> > 3. work on BP-4 to have a complete lifecycle management to take bookie
> out
> > and add bookie out.
> >
> > 1 is the immediate fix, so correct operations can still guarantee the
> > correctness.
> >
>
> I agree we need to take care of #1 ASAP and have a Issues opened and
> designs for #2 and #3.
>
> Thanks,
> JV
>
> >
> > - Sijie
> >
> >
> >
> > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > jujj...@gmail.com>
> > wrote:
> >
> > > > However, imagine that the fenced message is only in the journal on
> b2,
> > > > b2 crashes, something wipes the journal directory and then b2 comes
> > > > back up.
> > >
> > > In this case what happened?
> > > 1. We have WQ = 1
> > > 2. We had data loss (crash and comeup clean)
> > >
> > > But yeah, in addition to dataloss we have fencing violation too.
> > > The problem is not just wiped journal dir, but how we recognize the
> > bookie.
> > > Bookie is just recognized by its ip address, not by its incarnation.
> > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie format
> (b1t2)
> > > should be two different bookies, isn;t it?
> > > this is needed for the replication worker and the auditor too.
> > >
> > > Also, bookie needs to know if the writer/reader is intended to read
> from
> > > b1t2 not from b1t1.
> > > Looks like we have a hole here? Or I may not be fully understanding
> > cookie
> > > verification mechanism.
> > >
> > > Also as Ivan pointed out, we appear to think the lack of journal is
> > > implicitly a new bookie, but overall cluster doesn't differentiate
> > between
> > > incarnations.
> > >
> > > Thanks,
> > > JV
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <iv...@apache.org> wrote:
> > >
> > > > > The case you described here is "almost correct". But there is an
> key
> > > > here:
> > > > > B2 can't startup itself if journal disk is wiped out, because the
> > > cookie
> > > > is
> > > > > missed.
> > > > This is what I expected to see, but isn't the case.
> > > > <snip>
> > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > >             // try to read cookie from journal directory.
> > > >             for (File journalDirectory : journalDirectories) {
> > > >                 try {
> > > >                     Cookie journalCookie =
> > > > Cookie.readFromDirectory(journalDirectory);
> > > >                     journalCookies.add(journalCookie);
> > > >                     if (journalCookie.isBookieHostCreatedFromIp()) {
> > > >                         conf.setUseHostNameAsBookieID(false);
> > > >                     } else {
> > > >                         conf.setUseHostNameAsBookieID(true);
> > > >                     }
> > > >                 } catch (FileNotFoundException fnf) {
> > > >                     newEnv = true;
> > > >                     missedCookieDirs.add(journalDirectory);
> > > >                 }
> > > >             }
> > > > </snip>
> > > >
> > > > So if a journal is missing the cookie, newEnv is set to true. This
> > > > disabled the later checks.
> > > >
> > > > > Hower it can still happen in a different case: bit flap. In your
> > case,
> > > if
> > > > > fence bit in b2 is already persisted on disk, but it got corrupted.
> > > Then
> > > > it
> > > > > will cause the issue you described. One problem is we don't have
> > > checksum
> > > > > on the index file header when it stores those fence bits.
> > > > Yes, this is also an issue.
> > > >
> > > > -Ivan
> > > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>
-- 


-- Enrico Olivelli

Reply via email to