2017-10-09 9:21 GMT+02:00 Sijie Guo <guosi...@gmail.com>:

> okay, but why do you want to track the list of files? I don't get your idea
> here.
>


If you allow a bookie to start with a journal directory which contains the
cookie file but without the other files the bookie thinks that have been
persisted durably you will fall into the correctness issue we are talking
about, you will lose fence bits for instance.
So having a directory which contains the cookie flie is not enough to say
that the bookie is in good status.

-- Enrico





>
> - Sijie
>
> On Sun, Oct 8, 2017 at 11:45 PM, Enrico Olivelli <eolive...@gmail.com>
> wrote:
>
> > 2017-10-09 7:52 GMT+02:00 Sijie Guo <guosi...@gmail.com>:
> >
> > > On Sat, Oct 7, 2017 at 9:53 AM, Enrico Olivelli <eolive...@gmail.com>
> > > wrote:
> > >
> > > > Il sab 7 ott 2017, 00:27 Sijie Guo <guosi...@gmail.com> ha scritto:
> > > >
> > > > > Enrico,
> > > > >
> > > > > Let's try to come to a conclusion or an agreement what we should
> fix
> > > and
> > > > > improve, before talking who is going to drive this.
> > > > >
> > > >
> > > > Sure.
> > > >
> > > > This is my point of view:
> > > > View have separate issues:
> > > > - missing checksums, to protect fence bits
> > > > - have a bug in bookie boot, we should not allow empty directories
> > > > - have a clear lifecycle for the bookie, add/remove
> > > > - deal with reincarnation of bookies
> > > > - ensuring the correctness of the contents of the directories of the
> > > bookie
> > > >
> > > > I would like to add a new point, we have rhe cookie inside every
> > > configured
> > > > directory managed by the bookie.
> > > > No cookie -> no boot
> > > > This will not be enough, we have to write in that file not only the
> > > > identity of the bookie but the list of files expected to be in the
> > > > directory.
> > > > This way you will not boot with a corrupted directory.
> > > > Config ->  list of dirs -> list of files
> > > >
> > >
> > > I am not sure why this is a new point. This is exactly what cookie is
> > > doing, no?
> > >
> >
> > Sorry, I can't find such behavior in code on master brach
> > https://github.com/apache/bookkeeper/blob/master/
> > bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/Cookie.java
> >
> > I we have a copy of the cookie inside each directory (index + data +
> > journal) I mean that each file should carry the exact list of files
> > expected to be present in the directory at boot.
> > So for instance when you add a new file to the set of files on a journal
> > directory you must update the file in that directory, same for index,
> > data.....
> >
> > Maybe I am missing something.
> > It seems to me that cookie contains only a list a of directories not of
> > "files"
> >
> > Enrico
> >
> >
> >
> >
> > >
> > >
> > > >
> > > > I agree on the fact that the bookie should be added (bookie format)
> > only
> > > if
> > > > there is no reference to it in zk.
> > > > The bookie format operation should write the cookie in any configured
> > > > directory so that a bookie with empty directories won't ever start.
> > > >
> > > > I have to think more about this, but I wanted to share my first
> > thoughts
> > > >
> > > > Enrico
> > > >
> > > >
> > > > > - Sijie
> > > > >
> > > > > On Fri, Oct 6, 2017 at 1:14 PM, Enrico Olivelli <
> eolive...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > +1 for fixing the problem of missing cookie in 4.6
> > > > > >
> > > > > > Who drives the issue?
> > > > > >
> > > > > > Thank you all for the interesting points
> > > > > > Enrico
> > > > > >
> > > > > > Il ven 6 ott 2017, 21:27 Venkateswara Rao Jujjuri <
> > jujj...@gmail.com
> > > >
> > > > ha
> > > > > > scritto:
> > > > > >
> > > > > > > Thanks for the writeup Sijie, comments below.
> > > > > > >
> > > > > > > On Fri, Oct 6, 2017 at 12:14 PM, Sijie Guo <guosi...@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > I think the question is mainly around "how do we recognize
> the
> > > > > bookie"
> > > > > > or
> > > > > > > > "incarnations". And the purpose of a cookie is designed for
> > > > > addressing
> > > > > > > > "incarnations".
> > > > > > > >
> > > > > > > > I will try to cover following aspects, and will try to answer
> > > > > questions
> > > > > > > > that Ivan and JV raised.
> > > > > > > >
> > > > > > > > - what is cookie?
> > > > > > > > - how the behavior became bad?
> > > > > > > > - how do we fix current bad behavior?
> > > > > > > > - is the cookie enough?
> > > > > > > >
> > > > > > > >
> > > > > > > > *What is Cookie?*
> > > > > > > >
> > > > > > > > Cookie is originally introduced in this commit -
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > > > 3d8d7a7eb5
> > > > > > > > .
> > > > > > > >
> > > > > > > > A cookie is a identifier of a bookie. A cookie is created on
> > > > > zookeeper
> > > > > > > when
> > > > > > > > a brand new bookie joint the cluster, the cookie is
> > representing
> > > > the
> > > > > > > bookie
> > > > > > > > instance
> > > > > > > > during its lifecycle. The cookie is stored on all the disks
> for
> > > > > > > > verification purpose. so if any of the disks misses the
> cookie
> > > > (e.g.
> > > > > > > disks
> > > > > > > > were reformat or wiped out,
> > > > > > > > disks are not mounted correctly), a bookie will reject to
> > start.
> > > > > > > >
> > > > > > > >
> > > > > > > > *How the behavior became bad?*
> > > > > > > >
> > > > > > > > The original behavior worked as expected to use the cookie in
> > > > > zookeeper
> > > > > > > as
> > > > > > > > the source of truth. See
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > c6cc7cca3a85603c8e935ba6d06fbf
> > > > > > > > 3d8d7a7eb5
> > > > > > > >
> > > > > > > >
> > > > > > > > The behavior was changed at
> > > > > > > >
> > > > > > > https://github.com/apache/bookkeeper/commit/
> > > > > > 19b821c63b91293960041bca7b0316
> > > > > > > > 14a109a7b8
> > > > > > > > when trying to support both ip and hostname . It used journal
> > > > > directory
> > > > > > > as
> > > > > > > > the source-of-truth for verifying cookies.
> > > > > > > >
> > > > > > > > At the community meeting, I was saying a bookie should reject
> > > start
> > > > > > when
> > > > > > > a
> > > > > > > > cookie file is missing locally and that was my operational
> > > > > experience.
> > > > > > It
> > > > > > > > turns out twitter's branch didn't include the change at
> > > > > > > > 19b821c63b91293960041bca7b031614a109a7b8,
> > > > > > > > so it was still the original behavior at
> > > > > > > > c6cc7cca3a85603c8e935ba6d06fbf3d8d7a7eb5 .
> > > > > > > >
> > > > > > > > *How do we fix current bad behavior?*
> > > > > > > >
> > > > > > > > We basically need to revert the current behaviour to the
> > original
> > > > > > > designed
> > > > > > > > behavior. The cookie in zookeeper should be the
> source-of-truth
> > > for
> > > > > > > > validation.
> > > > > > > >
> > > > > > > > If the cookie works as expected (change the behavior to the
> > > > original
> > > > > > > > behavior), then it is the operational or lifecycle management
> > > > issue I
> > > > > > > > explained above.
> > > > > > > >
> > > > > > > > If a bookie failed with missing cookie, it should be:
> > > > > > > >
> > > > > > > > 1. taken out of the cluster
> > > > > > > > 2. run re-replication (autorecovery or manual recovery)
> > > > > > > > 3. ensure no ledgers using this bookie any more
> > > > > > > > 4. reformat the bookie
> > > > > > > > 5. add it back
> > > > > > > >
> > > > > > > > This can be automated by hooking into a scheduler (like k8s
> or
> > > > > mesos).
> > > > > > > But
> > > > > > > > it requires some sort of lifecycle management in order to
> > > automate
> > > > > such
> > > > > > > > operations. There is a BP-4:
> > > > > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > > > > > BP-4+-+BookKeeper+Lifecycle+Management
> > > > > > > > proposed for this purpose.
> > > > > > > >
> > > > > > > >
> > > > > > > > *Is the cookie enough?*
> > > > > > > >
> > > > > > > > Cookie (if we revert the current behavior to the original
> > > > behavior),
> > > > > > > should
> > > > > > > > be able to address most of the issues related to
> > "incarnations".
> > > > > > > >
> > > > > > > > There are still some corner cases will violate correctness
> > > issues.
> > > > > They
> > > > > > > are
> > > > > > > > related to "dangling writers" described in Ivan's first
> > comment.
> > > > > > > >
> > > > > > > > How can a writer tell whether bookies changed or ledger
> changed
> > > > when
> > > > > it
> > > > > > > > gets network partitioned?
> > > > > > > >
> > > > > > > > 1) Bookie Changed.
> > > > > > > >
> > > > > > > > Bookie can be reformatted and re-added to the cluster. Ivan
> and
> > > JV
> > > > > > > already
> > > > > > > > touch this on adding UUID.
> > > > > > > >
> > > > > > > > I think the UUID doesn't have to be part of ledger metadata.
> > > > because
> > > > > > > > auditor and replication worker would use the lifecycle
> > management
> > > > for
> > > > > > > > managing the lifecycle of bookies.
> > > > > > > >
> > > > > > >
> > > > > > > You are suggesting that the 'manual/scripted' lifecycle tool is
> > to
> > > > the
> > > > > > > rescue.
> > > > > > > a side cart solution.
> > > > > > >
> > > > > > > But what are we saving by not keeping this info in the
> metadata?
> > > > > > > metadata size? sure it is a huge win in ZK environment.
> > > > > > >
> > > > > > > >
> > > > > > > > But the connection should have the UUID informations.
> > > > > > > >
> > > > > > >
> > > > > > > By this you are suggesting  service discovery portion need to
> > have
> > > > UUID
> > > > > > > info
> > > > > > > but not metadata portion. Won't it be confusing to handle a
> case
> > > > where
> > > > > > > write fails
> > > > > > > on bookie because of UUID mismatch, and you may need to handle
> > that
> > > > > case
> > > > > > > and if you go back to the same bookie then no ensmeble changes.
> > > > > > >
> > > > > > > On the other hand if we introduce UUID into metadata, then we
> > don't
> > > > > need
> > > > > > to
> > > > > > > be
> > > > > > > explicitly depend on the side-cart solution.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Basically, any bookie client connects to a bookie, it needs
> to
> > > > carry
> > > > > > the
> > > > > > > > namespace uuid and the bookie uuid to ensure bookie is
> > connecting
> > > > to
> > > > > a
> > > > > > > > right bookie. This would prevent "dangling writers" connect
> to
> > > > > bookies
> > > > > > > that
> > > > > > > > are reformatted and added back.
> > > > > > > >
> > > > > > > >  While this is an issue, the problem can only get exposed in
> > > > > > pathological
> > > > > > > scenario
> > > > > > > where AQ bookies have went through this scenario, which is ~ 3
> > > > > > >
> > > > > > >
> > > > > > > 2) Ledger Changed.
> > > > > > > >
> > > > > > > > It is similar as what the case that Ivan' described. If a
> > writer
> > > > > > becomes
> > > > > > > > 'network partitioned', and the ledger is deleted during this
> > > > period,
> > > > > > > after
> > > > > > > > the writer comes back, the writer can still successfully
> write
> > > > > entries
> > > > > > to
> > > > > > > > the bookies, because the ledgers are already deleted and all
> > the
> > > > > > fencing
> > > > > > > > bits are gone.
> > > > > > > >
> > > > > > > > This violates the expectation of "fencing". but I am not sure
> > we
> > > > need
> > > > > > to
> > > > > > > > spend time on fixing this, because the ledger is already
> > > explicitly
> > > > > > > deleted
> > > > > > > > by the application. so I think the behavior should be
> > categorized
> > > > as
> > > > > > > > "undefined", just like "deleting a ledger when a writer is
> > still
> > > > > > writing
> > > > > > > > entries" is a undefined behavior.
> > > > > > > >
> > > > > > > >
> > > > > > > > To summarize my thought on this:
> > > > > > > >
> > > > > > > > 1. we need to revert the cookie behaviour to the original
> > > behavior.
> > > > > > make
> > > > > > > > sure the cookie works as expected.
> > > > > > > > 2. introduce UUID or epoch in the cookie. client connection
> > > should
> > > > > > carry
> > > > > > > > namespace uuid and bookie uuid when establishing the
> > connection.
> > > > > > > > 3. work on BP-4 to have a complete lifecycle management to
> take
> > > > > bookie
> > > > > > > out
> > > > > > > > and add bookie out.
> > > > > > > >
> > > > > > > > 1 is the immediate fix, so correct operations can still
> > guarantee
> > > > the
> > > > > > > > correctness.
> > > > > > > >
> > > > > > >
> > > > > > > I agree we need to take care of #1 ASAP and have a Issues
> opened
> > > and
> > > > > > > designs for #2 and #3.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > JV
> > > > > > >
> > > > > > > >
> > > > > > > > - Sijie
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 6, 2017 at 9:35 AM, Venkateswara Rao Jujjuri <
> > > > > > > > jujj...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > > However, imagine that the fenced message is only in the
> > > journal
> > > > > on
> > > > > > > b2,
> > > > > > > > > > b2 crashes, something wipes the journal directory and
> then
> > b2
> > > > > comes
> > > > > > > > > > back up.
> > > > > > > > >
> > > > > > > > > In this case what happened?
> > > > > > > > > 1. We have WQ = 1
> > > > > > > > > 2. We had data loss (crash and comeup clean)
> > > > > > > > >
> > > > > > > > > But yeah, in addition to dataloss we have fencing violation
> > > too.
> > > > > > > > > The problem is not just wiped journal dir, but how we
> > recognize
> > > > the
> > > > > > > > bookie.
> > > > > > > > > Bookie is just recognized by its ip address, not by its
> > > > > incarnation.
> > > > > > > > > Bookie1 at T1  (b1t1) ; and same bookie1 at T2 after bookie
> > > > format
> > > > > > > (b1t2)
> > > > > > > > > should be two different bookies, isn;t it?
> > > > > > > > > this is needed for the replication worker and the auditor
> > too.
> > > > > > > > >
> > > > > > > > > Also, bookie needs to know if the writer/reader is intended
> > to
> > > > read
> > > > > > > from
> > > > > > > > > b1t2 not from b1t1.
> > > > > > > > > Looks like we have a hole here? Or I may not be fully
> > > > understanding
> > > > > > > > cookie
> > > > > > > > > verification mechanism.
> > > > > > > > >
> > > > > > > > > Also as Ivan pointed out, we appear to think the lack of
> > > journal
> > > > is
> > > > > > > > > implicitly a new bookie, but overall cluster doesn't
> > > > differentiate
> > > > > > > > between
> > > > > > > > > incarnations.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > JV
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 6, 2017 at 8:46 AM, Ivan Kelly <
> iv...@apache.org
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > > The case you described here is "almost correct". But
> > there
> > > is
> > > > > an
> > > > > > > key
> > > > > > > > > > here:
> > > > > > > > > > > B2 can't startup itself if journal disk is wiped out,
> > > because
> > > > > the
> > > > > > > > > cookie
> > > > > > > > > > is
> > > > > > > > > > > missed.
> > > > > > > > > > This is what I expected to see, but isn't the case.
> > > > > > > > > > <snip>
> > > > > > > > > >       List<Cookie> journalCookies = Lists.newArrayList();
> > > > > > > > > >             // try to read cookie from journal directory.
> > > > > > > > > >             for (File journalDirectory :
> > journalDirectories)
> > > {
> > > > > > > > > >                 try {
> > > > > > > > > >                     Cookie journalCookie =
> > > > > > > > > > Cookie.readFromDirectory(journalDirectory);
> > > > > > > > > >                     journalCookies.add(journalCookie);
> > > > > > > > > >                     if
> > > > > (journalCookie.isBookieHostCreatedFromIp())
> > > > > > {
> > > > > > > > > >                         conf.setUseHostNameAsBookieID(
> > > false);
> > > > > > > > > >                     } else {
> > > > > > > > > >                         conf.setUseHostNameAsBookieID(
> > true);
> > > > > > > > > >                     }
> > > > > > > > > >                 } catch (FileNotFoundException fnf) {
> > > > > > > > > >                     newEnv = true;
> > > > > > > > > >                     missedCookieDirs.add(
> > journalDirectory);
> > > > > > > > > >                 }
> > > > > > > > > >             }
> > > > > > > > > > </snip>
> > > > > > > > > >
> > > > > > > > > > So if a journal is missing the cookie, newEnv is set to
> > true.
> > > > > This
> > > > > > > > > > disabled the later checks.
> > > > > > > > > >
> > > > > > > > > > > Hower it can still happen in a different case: bit
> flap.
> > In
> > > > > your
> > > > > > > > case,
> > > > > > > > > if
> > > > > > > > > > > fence bit in b2 is already persisted on disk, but it
> got
> > > > > > corrupted.
> > > > > > > > > Then
> > > > > > > > > > it
> > > > > > > > > > > will cause the issue you described. One problem is we
> > don't
> > > > > have
> > > > > > > > > checksum
> > > > > > > > > > > on the index file header when it stores those fence
> bits.
> > > > > > > > > > Yes, this is also an issue.
> > > > > > > > > >
> > > > > > > > > > -Ivan
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Jvrao
> > > > > > > > > ---
> > > > > > > > > First they ignore you, then they laugh at you, then they
> > fight
> > > > you,
> > > > > > > then
> > > > > > > > > you win. - Mahatma Gandhi
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jvrao
> > > > > > > ---
> > > > > > > First they ignore you, then they laugh at you, then they fight
> > you,
> > > > > then
> > > > > > > you win. - Mahatma Gandhi
> > > > > > >
> > > > > > --
> > > > > >
> > > > > >
> > > > > > -- Enrico Olivelli
> > > > > >
> > > > >
> > > > --
> > > >
> > > >
> > > > -- Enrico Olivelli
> > > >
> > >
> >
>

Reply via email to