poorbarcode opened a new pull request, #24722:
URL: https://github.com/apache/pulsar/pull/24722

   ### Motivation
   
   Our pulsar cluster encountered an issue where the response of the command 
`pulsar-admin topics stats --get-precise-backlog 
--get-subscription-backlog-size  <topic>` has a negative backlog.
   
   And topic internal stats show that a consumer acknowledged messages that do 
not exist in `topic.ledgers`
   - ledgerInfo:  
`{"ledgerId":1140377,"entries":5,"size":45573,"offloaded":false,"metadata":null,"underReplicated":false}`
   - But the acknowledged more entries: `(1140377:-1..1140377:6]`
   
   ---
   
   The issue happened this way:
   
   - Configurations
     - `managedLedgerDefaultEnsembleSize=2`
     - `managedLedgerDefaultWriteQuorum=2`
     - `managedLedgerDefaultAckQuorum=2`
   - Topic state
     - `ledgers`: `[3: {entries: 1}]`
   - Scenarios
     -  `bookie-2` is slow
     - ZK connections are unstable, which may cause a race condition on owning 
topics
   
   | time | `broker-1` | `broker-2` | `bookie-1` | `bookie-2` |
   | --- | --- | --- | --- | --- |
   | 1 | owned the topic |
   | 2 | start publish message `3:1`(has not completed yet) |
   | 3 | | | receives the write request `3:1` | receives the write request 
`3:1` |
   | 4 | | | write disk finished | write disk finished |
   | 5 | | | responds to `broker-1` | 
   | 2 | start publish message `3:2`(has not sent to Bookies yet) |
   | 6 | metadata store connection unstable |
   | 7 | | onwed the topic |
   | 8 | assumes itself is still the owner of the topic |
   | 9 | | start to close the ledger `3` |
   | 10 | | | receives the close request of the ledger `3` | receives the close 
request of the ledger `3` |
   | 11 | | | fence & close the ledger `3` |
   | 12 | | | receives the write request `3:2` | receives the write request 
`3:2` |
   | 13 | | | responds `fenced error` since the ledger has been closed |
   | 14 | received a `fenced error`|
   | 15 | switch ledger to `4` and write topic metadata with `[3: {entries: 
1}]` since `bookie-2` has not responded to the request of writing `3:1` |
   | 16 | | | | responds to `broker-1` for the writing `3:1` |
   | 17 | | | | fence & close the ledger `3` |
   | 18 | | closed ledger `3`, and it has `2` entries |
   | 19 | received the response of writing `3:1` |
   
   Highlight: The issue also leads to messages being lost.
   
   You can reproduce the issue with the new test 
`testConcurrentlyCloseCurrentLedger`
   
   
   ### Modifications
   
   Re-check the ledger entries from the metadata store if received a `fenced 
error` when writing Bookies
   
   ### Matching PR in forked repository
   
   PR in forked repository: x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to