Thanks lot for you explanation. sent from iPhone
------------------ Original ------------------ From: Jack Vanlightly <jvanligh...@splunk.com.INVALID> Date: Tue,Sep 14,2021 9:11 PM To: dev <dev@bookkeeper.apache.org> Subject: Re: AutoRecovery failed replicate ledger , because, it would read lac from failed bookie 1. Any ledgers that are open, and whose current ensemble include that downed bookie would be unrecoverable until it comes back online. If other ledgers are recoverable that are hosted on that bookie then either the ledgers are closed, or they are open (or in-recovery) but the last ensemble does not include this bookie. The LAC read only gets sent to the bookies of the current ensemble (and a ledger may have multiple ensembles if ensemble changes occurred). 2. Look for "coverageSet.checkCovered()" Jack On Tue, Sep 14, 2021 at 3:00 PM zhangao <gaozhangmin...@qq.com.invalid> wrote: > [ External sender. Exercise caution. ] > > my ack quorum is 1, please let me explain my confusion: > 1、when one bookie is down, as you said, why some ledgers can be replicated > successfully, but some cannot. > 2、from the code below in PendingReadLacOp, i don't see any codes relation > to ack quorum when read lac. > > > public void initiate() { > for (int i = 0; i < currentEnsemble.size(); i++) { > bookieClient.readLac(currentEnsemble.get(i), lh.ledgerId, this, i); > } > } > > > ------------------ 原始邮件 ------------------ > 发件人: > "dev" > < > jvanligh...@splunk.com.INVALID&gt;; > 发送时间:&nbsp;2021年9月14日(星期二) 晚上8:49 > 收件人:&nbsp;"dev"<dev@bookkeeper.apache.org&gt;; > > 主题:&nbsp;Re: AutoRecovery failed replicate ledger , because, it would read > lac from failed bookie > > > > An LAC read will fail in this way if Ack Quorum or more bookies respond > with any other than OK, NoSuchEntry, NoSuchLedger. > > What is your ack quorum? If it is just 1 (not a good setting), then a > single bookie being down will make the LAC read fail this way. If your ack > quorum is 2, then 2 bookies being down will cause it etc. > > Jack > > On Tue, Sep 14, 2021 at 1:17 PM zhangao <gaozhangmin...@qq.com.invalid&gt; > wrote: > > &gt; [ External sender. Exercise caution. ] > &gt; > &gt; As title, When bookie is lost, the ledger which state is open cannot > &gt; replicated because of reading lac from failed bookie. > &gt; it would failed read lac from failed bookie, because it cannot be > &gt; connected. > &gt; > &gt; How bookkeeper auto recovery deal with open ledger in failed bookie ? > &gt; > &gt; I don't know if it's a bug or not. > &gt; > &gt; The error log: > &gt; > &gt; 12:29:57.072 [main-EventThread] INFO&amp;nbsp; > &gt; org.apache.bookkeeper.client.DefaultBookieAddressResolver - Cannot > resolve > &gt; x.x.x.x:3181, bookie is unknown > &gt; > org.apache.bookkeeper.client.BKException$BKBookieHandleNotAvailableException: > &gt; Bookie handle is not available > &gt; > &gt; 12:29:57.072 [main-EventThread] ERROR > &gt; org.apache.bookkeeper.proto.PerChannelBookieClient - Cannot connect to > &gt; x.x.x.x:3181 as endpoint resolution failed (probably bookie is down) > err > &gt; > org.apache.bookkeeper.proto.BookieAddressResolver$BookieIdNotResolvedException: > &gt; Cannot resolve bookieId x.x.x.x:3181, bookie does not exist or it is > not > &gt; running > &gt; > &gt; 12:29:57.078 [BookKeeperClientWorker-OrderedExecutor-29-0] > INFO&amp;nbsp; > &gt; org.apache.bookkeeper.client.PendingReadLacOp - While readLac ledger: > 96789 > &gt; did not hear success responses from all of ensemble > &gt; > &gt; 12:29:57.078 [ReplicationWorker] INFO&amp;nbsp; > &gt; org.apache.bookkeeper.replication.ReplicationWorker - BKReadException > while > &gt; rereplicating ledger 96789. Enough Bookies might not have available > So, no > &gt; harm to continue