Re: [DISCUESS] PIP 57: Improve Broker's Zookeeper Session Timeout Handling

PengHui Li Mon, 24 Feb 2020 20:14:00 -0800

Hi Joe,

I will improve the PIP, and then send it back to the email thread soon.


-Penghui
在 2020年2月25日 +0800 AM4:41，Sijie Guo <guosi...@gmail.com>，写道：
> On Mon, Feb 24, 2020 at 11:33 AM Joe F <joefranc...@gmail.com> wrote:
>
> > Sjiie, Penghui,
> >
> > Thank you. Can we get the PIP to be a more detailed write-up ? I would
> > like this PIP to be more comprehensive
> >
>
> I think Penghui can work on improving this to make this PIP to be more
> comprehensive.
>
> It is good to start with adding enough background and context to state what
> is the guarantee that we currently have.
>
>
> >
> > > > Hence we need to draw an agreement on understanding
> > > > WHAT is actually guarantees the correctness in current Pulsar design. We
> > > > then can move forward with a conclusion about how to do it.
> >
> > That would be great. I have been thinking recently about whether we can
> > formalize the system through a TLA model. It would be ideal, but it will
> > also require BK to provide one. Whether we use TLA or not, we should have
> > an understanding of WHAT is actually guarantees the correctness in current
> > Pulsar design _written down_ . At least then we will have a model against
> > which we can align change and fixes.
> >
>
> TLA is a super great idea to start. It has to be done at both storage
> (bookkeeper) and the broker level.
>
> Since we used to compare bookkeeper/distributedlog and raft, I think
> bookkeeper/ML is pretty close to RAFT. If there is a TLA for RAFT, it
> should be a good starting point to follow. If you are planning to start the
> work, happy to join the force.
>
> - Sijie
>
>
> >
> > Joe
> >
> > On Sun, Feb 23, 2020 at 11:44 PM Sijie Guo <guosi...@gmail.com> wrote:
> >
> > > Sorry for the late reply.
> > >
> > > Joe - There are two things I would like to clarify first.
> > >
> > > 1) I think you have a misunderstanding about the zookeeper lock
> > "ephemeral
> > > znode" and bookkeeper/ML fencing. Let's step back to understand the
> > current
> > > Pulsar's behavior first.
> > >
> > > - A zookeeper lock doesn't prevent a dural-writers situation happening. A
> > > simple case - Node A creates a lock and Node B stands by. If the lock
> > (the
> > > ephemeral znode expired at zookeeper), Node B acquires the lock and
> > becomes
> > > the owner. But Node A might NOT receive the session expire notification
> > > because of a network partition, hence A still thinks it is the owner. So
> > > there is a given duration that both A and B think they are the owners.
> > >
> > > - The correctness of using a zookeeper lock should be gated by the
> > > exclusiveness of a resource. In ML, the exclusiveness is provided by
> > > single-writer semantics offered by bookkeeper and CAS operations offered
> > by
> > > zookeeper.
> > >
> > > So a zookeeper lock (or an external locking mechanism) only ensures
> > > "stable" ownership of a resource in a long duration. but it doesn't
> > prevent
> > > dural ownerships. The resource itself should provide a mechanism to
> > ensure
> > > exclusiveness. BookKeeper/ML does that via fencing. HBase uses ACL to
> > > achieve "fencing" for regions.
> > >
> > > Martin Kleppmann wrote a blog post about this. It is a well-written blog
> > > post to check out.
> > >
> > >
> > https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
> > >
> > > 2) Increasing session timeout doesn't prevent session expired. Session
> > > expiry can still happen when zookeeper leader is crashed, or client
> > paused,
> > > or any network hiccups. Increasing session timeout introduces
> > side-effects.
> > > The broker fail-over time will increase as well. It means the topic
> > > unavailable duration is increased when a broker is going down.
> > >
> > >
> > > ---
> > >
> > > Penghui, Joe,
> > >
> > > I think there are multiple things coupled in this discussion relate to
> > > zookeeper.
> > >
> > > ZooKeeper is mainly used for two places.
> > >
> > > 1) "ownership" for bundles and service discovery for "brokers". It uses
> > > ephemeral znodes. They are session related and will be impacted by
> > session
> > > expiries.
> > > 2) metadata management for both policies and ML. It uses persistent
> > znodes.
> > > They are not session related. But unfortunately, they are also impacted
> > by
> > > session expiries. Because zookeeper ties session management to connection
> > > management.
> > >
> > > For 2), it is safe to retry creating a zookeeper client to establish a
> > > session when the session expired. We can give as high session expire time
> > > as possible, since they don't impact failover time.
> > >
> > > So the main discussion should be about 1) - whether we are safe to
> > > re-establish a zookeeper session to re-acquire bundles after the previous
> > > session is expired. Hence we need to draw an agreement on understanding
> > > WHAT is actually guarantees the correctness in current Pulsar design. We
> > > then can move forward with a conclusion about how to do it.
> > >
> > > - Sijie
> > >
> > > On Sat, Feb 22, 2020 at 11:40 PM Joe F <joefranc...@gmail.com> wrote:
> > >
> > > > On Sat, Feb 22, 2020 at 6:28 PM PengHui Li <codelipeng...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi, joe
> > > > >
> > > > > The fundamental correctness is guaranteed by the fencing mechanism
> > > > > provided by Apache BookKeeper and the CAS operation provided by the
> > > > > metadata storage. Both fencing and CAS operations will prevent two
> > > owners
> > > > > updating data or metadata at the same time.
> > > >
> > > >
> > > > This may be, as I said "It may be possible that underlying lower level
> > > > locks may prevent catastrophe, but that does not validate this
> > violation
> > > > of basic principles. " I am far too familiar with how BK works, and
> > how
> > > > CAS works for ML metadata storage, to know where all the bodies are
> > > buried.
> > > >
> > > > This default shutdown behavior isn’t changed. We just introduce an
> > > > > alternative way for improving stability when Zookeeper is doing
> > leader
> > > > > election or becomes unavailable
> > > >
> > > >
> > > > This is just a claim. I would argue that it does the opposite.
> > > >
> > > >
> > > > >
> > > > >
> > > > According to the following rules, I think this will not break current
> > > > > system principles.
> > > > >
> > > > >
> > > > > 1. If the znode for the bundle is deleted, this is consistent with
> > the
> > > > > current behavior. The broker who acquires for the lock first will
> > > become
> > > > > the owner.
> > > > > 2. If the znode for the bundle is not deleted, other brokers also
> > > unable
> > > > > to acquire the lock. Both the broker that re-creates the session and
> > > > other
> > > > > brokers need to wait for the znode deletion. Then the broker who
> > > acquires
> > > > > for the lock first will become the owner.
> > > > > 3. If the bundle ownership changed, the broker that re-creates the
> > > > session
> > > > > unable to acquire the lock. So the broker should unload the bundle,
> > > This
> > > > is
> > > > > also consistent with current ownership change behavior.
> > > > > 4. Also, if unexpected exceptions throw during the re-own process,
> > the
> > > > > broker needs to shutdown itself.
> > > > >
> > > > >
> > > > I don't think you address the issue I have raised. Say 30 secs is the
> > > > timeout. Let us say the broker B1 lost connection at t. Then B1 loses
> > > the
> > > > session at t+30 secs, With your logic, B1 continues to service the
> > > topic
> > > > as if it still owns it. Meanwhile B2 acquires the topic at t+31 and
> > loses
> > > > its connection at t+32. (and loses its session at t+62) At t+62 B3
> > > acquires
> > > > it. .And loses it connection at t+63. Now B4 acquires it B4 crashes.
> > Now
> > > > the original broker B1 reacquires the session and goes own as if
> > nothing
> > > > occurred in between, merrily operating as if nothing occurred in the
> > > > meantime ( and so could B2 and B3 ).
> > > >
> > > > All fine, as you say, because of lower level locks in BK and ML to
> > > prevent
> > > > catastrophe...
> > > >
> > > > If you want to make the case that bundle ownership does not guarantee
> > > > underlying topic ownership, and topic ownership is arbitrated by BK/
> > > > ML(metadata), then explicitly make that case. Then we can debate the
> > > > merits of that, and see if the code and design allows for it. Because
> > as
> > > > it is, that is not how Pulsar is designed. Now, topic ownership is
> > > > arbitrated by the bundle lock. This is not a change that should
> > casually
> > > > be slipped into the system.
> > > >
> > > > And my original qn still stands - if this session loss is such an issue
> > > > for some use cases, why not raise the session timeout? The broker can
> > > > safely keep the session for longer. That's far preferable to running
> > the
> > > > risk of doing this.
> > > >
> > > >
> > > > >
> > > > > Thanks,
> > > > > Penghui
> > > > > On Feb 22, 2020, 12:27 +0800, PengHui Li <codelipeng...@gmail.com>,
> > > > wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > I have drafted a proposal for improving broker's Zookeeper session
> > > > > timeout handling. You can find at
> > > > >
> > > >
> > >
> > https://github.com/apache/pulsar/wiki/PIP-57%3A-Improve-Broker%27s-Zookeeper-Session-Timeout-Handling
> > > > > >
> > > > > > Also I copy it to the email thread for easier to view. Any
> > > suggestions
> > > > > or ideas welcome to join the discussion.
> > > > > >
> > > > > >
> > > > > > PIP 57: Improve Broker's Zookeeper Session Timeout Handling
> > > > > > Motivation
> > > > > > In Pulsar, brokers use Zookeeper as the configuration store and
> > > broker
> > > > > metadata maintaining. We can also call them Global Zookeeper and
> > Local
> > > > > Zookeeper.
> > > > > > The Global Zookeeper maintains the namespace policies, cluster
> > > > metadata,
> > > > > and partitioned topic metadata. To reduce read operations on
> > Zookeeper,
> > > > > each broker has a cache for global Zookeeper. The Global Zookeeper
> > > cache
> > > > > updates on znode changed. Currently, when the present session timeout
> > > > > happens on global Zookeeper, a new session starts. Broker does not
> > > create
> > > > > any EPHEMERAL znodes on global Zookeeper.
> > > > > > The Local Zookeeper maintains the local cluster metadata, such as
> > > > broker
> > > > > load data, topic ownership data, managed ledger metadata, and Bookie
> > > rack
> > > > > information. All of broker load data and topic ownership data are
> > > create
> > > > > EPHEMERAL nodes on Local Zookeeper. Currently, when session timeout
> > > > happens
> > > > > on Local Zookeeper, the broker shutdown itself.
> > > > > > Shutdown broker results in ownership change of topics that the
> > broker
> > > > > owned. However, we encountered lots of problems related to the
> > current
> > > > > session timeout handling. Such as broker with long JVM GC pause,
> > Local
> > > > > Zookeeper under high workload. Especially the latter may cause all
> > > broker
> > > > > shutdowns.
> > > > > > So, the purpose of this proposal is to improve session timeout
> > > handling
> > > > > on Local Zookeeper to avoid unnecessary broker shutdown.
> > > > > > Approach
> > > > > > Same as the Global Zookeeper session timeout handling and Zookeeper
> > > > > session timeout handling in BookKeeper, a new session should start
> > when
> > > > the
> > > > > present session timeout.
> > > > > > If a new session failed to start, the broker would retry several
> > > times.
> > > > > The retry times depend on the configuration of the broker. After the
> > > > number
> > > > > of retries, if still can't start session success, the broker still
> > > needs
> > > > to
> > > > > be shut down since this may be a problem with the Zookeeper cluster.
> > > The
> > > > > user needs to restart the broker after the zookeeper cluster returns
> > to
> > > > > normal.
> > > > > > If a new session starts success, the issue is slightly more
> > > > complicated.
> > > > > So, I will introduce every scene separately.
> > > > > > Topic ownership data handling
> > > > > > The topic ownership data maintain all namespace bundles that owned
> > by
> > > > > the broker. In Zookeeper, create an EPHEMERAL znode for each
> > namespace
> > > > > bundle. When the session timeout happens on the local Zookeeper, all
> > of
> > > > the
> > > > > EPHEMERAL znode maintained by this broker will delete automatically.
> > We
> > > > > need some mechanism to avoid the unnecessary ownership transfer of
> > the
> > > > > bundles. Since the broker cached the owned bundles in memory, the
> > > broker
> > > > > can use the cache to re-own the bundles.
> > > > > > Firstly, when the broker to re-own the bundle, if the znode of the
> > > > > bundle exists at Zookeeper and the owner is this broker, it may be
> > that
> > > > > Zookeeper has not deleted the znode. The broker should check if the
> > > > > ephemeral owner is the current session ID. If not, the broker should
> > > wait
> > > > > for the znode deletion.
> > > > > > Then the broker tries to own the bundle. If the broker owns the
> > > bundle
> > > > > success means the bundle not owned by other brokers, the broker
> > should
> > > > > check whether to preload the topics under the bundle. If the broker
> > > > failed
> > > > > to own the bundle means the bundle owned by another broker. The
> > broker
> > > > > should unload the bundle.
> > > > > > Theoretically, the mechanism can guarantee that the ownership of
> > most
> > > > > bundles will not change during the session timeout.
> > > > > > Broker load data handling
> > > > > > The load data used for namespace bundle load balancing, so there is
> > > no
> > > > > need to be overly complicated in handling. The only effect is that it
> > > > will
> > > > > interfere with the choice of the broker when finding a candidate
> > broker
> > > > for
> > > > > a namespace bundle. Even without selecting the optimal broker, it
> > will
> > > > > continue to relocate the namespace bundles.
> > > > > > So for broker load data handling, we need to guarantee the load
> > data
> > > of
> > > > > the broker can report success.
> > > > > > Other scene handing
> > > > > > There are also some usage scenarios of the local Zookeeper,
> > > BookKeeper
> > > > > client, managed ledger meta, bookie rack information, and schema
> > > > metadata.
> > > > > All of these scenarios do not create any EPHEMERAL znodes on the
> > > > Zookeeper.
> > > > > Pulsar introduces the Zookeeper cache for the local Zookeeper. The
> > > cache
> > > > is
> > > > > invalidated when the session timeout occurs.
> > > > > > Configurations
> > > > > > A new configuration parameter zookeeperSessionExpiredPolicy added
> > to
> > > > > broker.conf to control the zookeeper session expired policy. There
> > are
> > > > two
> > > > > options, shutdown and reconnect.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Penghui
> > > > >
> > > >
> > >
> >

Re: [DISCUESS] PIP 57: Improve Broker's Zookeeper Session Timeout Handling

Reply via email to