Hi Joe, I will improve the PIP, and then send it back to the email thread soon.
-Penghui 在 2020年2月25日 +0800 AM4:41,Sijie Guo <guosi...@gmail.com>,写道: > On Mon, Feb 24, 2020 at 11:33 AM Joe F <joefranc...@gmail.com> wrote: > > > Sjiie, Penghui, > > > > Thank you. Can we get the PIP to be a more detailed write-up ? I would > > like this PIP to be more comprehensive > > > > I think Penghui can work on improving this to make this PIP to be more > comprehensive. > > It is good to start with adding enough background and context to state what > is the guarantee that we currently have. > > > > > > > > Hence we need to draw an agreement on understanding > > > > WHAT is actually guarantees the correctness in current Pulsar design. We > > > > then can move forward with a conclusion about how to do it. > > > > That would be great. I have been thinking recently about whether we can > > formalize the system through a TLA model. It would be ideal, but it will > > also require BK to provide one. Whether we use TLA or not, we should have > > an understanding of WHAT is actually guarantees the correctness in current > > Pulsar design _written down_ . At least then we will have a model against > > which we can align change and fixes. > > > > TLA is a super great idea to start. It has to be done at both storage > (bookkeeper) and the broker level. > > Since we used to compare bookkeeper/distributedlog and raft, I think > bookkeeper/ML is pretty close to RAFT. If there is a TLA for RAFT, it > should be a good starting point to follow. If you are planning to start the > work, happy to join the force. > > - Sijie > > > > > > Joe > > > > On Sun, Feb 23, 2020 at 11:44 PM Sijie Guo <guosi...@gmail.com> wrote: > > > > > Sorry for the late reply. > > > > > > Joe - There are two things I would like to clarify first. > > > > > > 1) I think you have a misunderstanding about the zookeeper lock > > "ephemeral > > > znode" and bookkeeper/ML fencing. Let's step back to understand the > > current > > > Pulsar's behavior first. > > > > > > - A zookeeper lock doesn't prevent a dural-writers situation happening. A > > > simple case - Node A creates a lock and Node B stands by. If the lock > > (the > > > ephemeral znode expired at zookeeper), Node B acquires the lock and > > becomes > > > the owner. But Node A might NOT receive the session expire notification > > > because of a network partition, hence A still thinks it is the owner. So > > > there is a given duration that both A and B think they are the owners. > > > > > > - The correctness of using a zookeeper lock should be gated by the > > > exclusiveness of a resource. In ML, the exclusiveness is provided by > > > single-writer semantics offered by bookkeeper and CAS operations offered > > by > > > zookeeper. > > > > > > So a zookeeper lock (or an external locking mechanism) only ensures > > > "stable" ownership of a resource in a long duration. but it doesn't > > prevent > > > dural ownerships. The resource itself should provide a mechanism to > > ensure > > > exclusiveness. BookKeeper/ML does that via fencing. HBase uses ACL to > > > achieve "fencing" for regions. > > > > > > Martin Kleppmann wrote a blog post about this. It is a well-written blog > > > post to check out. > > > > > > > > https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html > > > > > > 2) Increasing session timeout doesn't prevent session expired. Session > > > expiry can still happen when zookeeper leader is crashed, or client > > paused, > > > or any network hiccups. Increasing session timeout introduces > > side-effects. > > > The broker fail-over time will increase as well. It means the topic > > > unavailable duration is increased when a broker is going down. > > > > > > > > > --- > > > > > > Penghui, Joe, > > > > > > I think there are multiple things coupled in this discussion relate to > > > zookeeper. > > > > > > ZooKeeper is mainly used for two places. > > > > > > 1) "ownership" for bundles and service discovery for "brokers". It uses > > > ephemeral znodes. They are session related and will be impacted by > > session > > > expiries. > > > 2) metadata management for both policies and ML. It uses persistent > > znodes. > > > They are not session related. But unfortunately, they are also impacted > > by > > > session expiries. Because zookeeper ties session management to connection > > > management. > > > > > > For 2), it is safe to retry creating a zookeeper client to establish a > > > session when the session expired. We can give as high session expire time > > > as possible, since they don't impact failover time. > > > > > > So the main discussion should be about 1) - whether we are safe to > > > re-establish a zookeeper session to re-acquire bundles after the previous > > > session is expired. Hence we need to draw an agreement on understanding > > > WHAT is actually guarantees the correctness in current Pulsar design. We > > > then can move forward with a conclusion about how to do it. > > > > > > - Sijie > > > > > > On Sat, Feb 22, 2020 at 11:40 PM Joe F <joefranc...@gmail.com> wrote: > > > > > > > On Sat, Feb 22, 2020 at 6:28 PM PengHui Li <codelipeng...@gmail.com> > > > > wrote: > > > > > > > > > Hi, joe > > > > > > > > > > The fundamental correctness is guaranteed by the fencing mechanism > > > > > provided by Apache BookKeeper and the CAS operation provided by the > > > > > metadata storage. Both fencing and CAS operations will prevent two > > > owners > > > > > updating data or metadata at the same time. > > > > > > > > > > > > This may be, as I said "It may be possible that underlying lower level > > > > locks may prevent catastrophe, but that does not validate this > > violation > > > > of basic principles. " I am far too familiar with how BK works, and > > how > > > > CAS works for ML metadata storage, to know where all the bodies are > > > buried. > > > > > > > > This default shutdown behavior isn’t changed. We just introduce an > > > > > alternative way for improving stability when Zookeeper is doing > > leader > > > > > election or becomes unavailable > > > > > > > > > > > > This is just a claim. I would argue that it does the opposite. > > > > > > > > > > > > > > > > > > > > > > According to the following rules, I think this will not break current > > > > > system principles. > > > > > > > > > > > > > > > 1. If the znode for the bundle is deleted, this is consistent with > > the > > > > > current behavior. The broker who acquires for the lock first will > > > become > > > > > the owner. > > > > > 2. If the znode for the bundle is not deleted, other brokers also > > > unable > > > > > to acquire the lock. Both the broker that re-creates the session and > > > > other > > > > > brokers need to wait for the znode deletion. Then the broker who > > > acquires > > > > > for the lock first will become the owner. > > > > > 3. If the bundle ownership changed, the broker that re-creates the > > > > session > > > > > unable to acquire the lock. So the broker should unload the bundle, > > > This > > > > is > > > > > also consistent with current ownership change behavior. > > > > > 4. Also, if unexpected exceptions throw during the re-own process, > > the > > > > > broker needs to shutdown itself. > > > > > > > > > > > > > > I don't think you address the issue I have raised. Say 30 secs is the > > > > timeout. Let us say the broker B1 lost connection at t. Then B1 loses > > > the > > > > session at t+30 secs, With your logic, B1 continues to service the > > > topic > > > > as if it still owns it. Meanwhile B2 acquires the topic at t+31 and > > loses > > > > its connection at t+32. (and loses its session at t+62) At t+62 B3 > > > acquires > > > > it. .And loses it connection at t+63. Now B4 acquires it B4 crashes. > > Now > > > > the original broker B1 reacquires the session and goes own as if > > nothing > > > > occurred in between, merrily operating as if nothing occurred in the > > > > meantime ( and so could B2 and B3 ). > > > > > > > > All fine, as you say, because of lower level locks in BK and ML to > > > prevent > > > > catastrophe... > > > > > > > > If you want to make the case that bundle ownership does not guarantee > > > > underlying topic ownership, and topic ownership is arbitrated by BK/ > > > > ML(metadata), then explicitly make that case. Then we can debate the > > > > merits of that, and see if the code and design allows for it. Because > > as > > > > it is, that is not how Pulsar is designed. Now, topic ownership is > > > > arbitrated by the bundle lock. This is not a change that should > > casually > > > > be slipped into the system. > > > > > > > > And my original qn still stands - if this session loss is such an issue > > > > for some use cases, why not raise the session timeout? The broker can > > > > safely keep the session for longer. That's far preferable to running > > the > > > > risk of doing this. > > > > > > > > > > > > > > > > > > Thanks, > > > > > Penghui > > > > > On Feb 22, 2020, 12:27 +0800, PengHui Li <codelipeng...@gmail.com>, > > > > wrote: > > > > > > Hi all, > > > > > > > > > > > > I have drafted a proposal for improving broker's Zookeeper session > > > > > timeout handling. You can find at > > > > > > > > > > > > > > https://github.com/apache/pulsar/wiki/PIP-57%3A-Improve-Broker%27s-Zookeeper-Session-Timeout-Handling > > > > > > > > > > > > Also I copy it to the email thread for easier to view. Any > > > suggestions > > > > > or ideas welcome to join the discussion. > > > > > > > > > > > > > > > > > > PIP 57: Improve Broker's Zookeeper Session Timeout Handling > > > > > > Motivation > > > > > > In Pulsar, brokers use Zookeeper as the configuration store and > > > broker > > > > > metadata maintaining. We can also call them Global Zookeeper and > > Local > > > > > Zookeeper. > > > > > > The Global Zookeeper maintains the namespace policies, cluster > > > > metadata, > > > > > and partitioned topic metadata. To reduce read operations on > > Zookeeper, > > > > > each broker has a cache for global Zookeeper. The Global Zookeeper > > > cache > > > > > updates on znode changed. Currently, when the present session timeout > > > > > happens on global Zookeeper, a new session starts. Broker does not > > > create > > > > > any EPHEMERAL znodes on global Zookeeper. > > > > > > The Local Zookeeper maintains the local cluster metadata, such as > > > > broker > > > > > load data, topic ownership data, managed ledger metadata, and Bookie > > > rack > > > > > information. All of broker load data and topic ownership data are > > > create > > > > > EPHEMERAL nodes on Local Zookeeper. Currently, when session timeout > > > > happens > > > > > on Local Zookeeper, the broker shutdown itself. > > > > > > Shutdown broker results in ownership change of topics that the > > broker > > > > > owned. However, we encountered lots of problems related to the > > current > > > > > session timeout handling. Such as broker with long JVM GC pause, > > Local > > > > > Zookeeper under high workload. Especially the latter may cause all > > > broker > > > > > shutdowns. > > > > > > So, the purpose of this proposal is to improve session timeout > > > handling > > > > > on Local Zookeeper to avoid unnecessary broker shutdown. > > > > > > Approach > > > > > > Same as the Global Zookeeper session timeout handling and Zookeeper > > > > > session timeout handling in BookKeeper, a new session should start > > when > > > > the > > > > > present session timeout. > > > > > > If a new session failed to start, the broker would retry several > > > times. > > > > > The retry times depend on the configuration of the broker. After the > > > > number > > > > > of retries, if still can't start session success, the broker still > > > needs > > > > to > > > > > be shut down since this may be a problem with the Zookeeper cluster. > > > The > > > > > user needs to restart the broker after the zookeeper cluster returns > > to > > > > > normal. > > > > > > If a new session starts success, the issue is slightly more > > > > complicated. > > > > > So, I will introduce every scene separately. > > > > > > Topic ownership data handling > > > > > > The topic ownership data maintain all namespace bundles that owned > > by > > > > > the broker. In Zookeeper, create an EPHEMERAL znode for each > > namespace > > > > > bundle. When the session timeout happens on the local Zookeeper, all > > of > > > > the > > > > > EPHEMERAL znode maintained by this broker will delete automatically. > > We > > > > > need some mechanism to avoid the unnecessary ownership transfer of > > the > > > > > bundles. Since the broker cached the owned bundles in memory, the > > > broker > > > > > can use the cache to re-own the bundles. > > > > > > Firstly, when the broker to re-own the bundle, if the znode of the > > > > > bundle exists at Zookeeper and the owner is this broker, it may be > > that > > > > > Zookeeper has not deleted the znode. The broker should check if the > > > > > ephemeral owner is the current session ID. If not, the broker should > > > wait > > > > > for the znode deletion. > > > > > > Then the broker tries to own the bundle. If the broker owns the > > > bundle > > > > > success means the bundle not owned by other brokers, the broker > > should > > > > > check whether to preload the topics under the bundle. If the broker > > > > failed > > > > > to own the bundle means the bundle owned by another broker. The > > broker > > > > > should unload the bundle. > > > > > > Theoretically, the mechanism can guarantee that the ownership of > > most > > > > > bundles will not change during the session timeout. > > > > > > Broker load data handling > > > > > > The load data used for namespace bundle load balancing, so there is > > > no > > > > > need to be overly complicated in handling. The only effect is that it > > > > will > > > > > interfere with the choice of the broker when finding a candidate > > broker > > > > for > > > > > a namespace bundle. Even without selecting the optimal broker, it > > will > > > > > continue to relocate the namespace bundles. > > > > > > So for broker load data handling, we need to guarantee the load > > data > > > of > > > > > the broker can report success. > > > > > > Other scene handing > > > > > > There are also some usage scenarios of the local Zookeeper, > > > BookKeeper > > > > > client, managed ledger meta, bookie rack information, and schema > > > > metadata. > > > > > All of these scenarios do not create any EPHEMERAL znodes on the > > > > Zookeeper. > > > > > Pulsar introduces the Zookeeper cache for the local Zookeeper. The > > > cache > > > > is > > > > > invalidated when the session timeout occurs. > > > > > > Configurations > > > > > > A new configuration parameter zookeeperSessionExpiredPolicy added > > to > > > > > broker.conf to control the zookeeper session expired policy. There > > are > > > > two > > > > > options, shutdown and reconnect. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > Penghui > > > > > > > > > > > > > >