Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

PengHui Li Tue, 26 Jan 2021 16:55:22 -0800

Thanks for all your suggestion, LinLin and I will work on abstract this
acknowledgment management as an
interface.


- Penghui

Sijie Guo <guosi...@gmail.com> 于2021年1月25日周一 上午11:14写道：

> Rajan,
>
> Thank you for sharing the prototype! That looks great to me.
>
> In order for the community to evolve and experiment with different
> approaches, how about we abstract this acknowledgment management as an
> interface?
>
> If this approach works for you, maybe Lin Lin can focus on abstracting this
> interface. Then we can implement your approach and his approach using this
> interface. Users can decide which implementation to use.
>
> Thanks,
> Sijie
>
> On Sat, Jan 23, 2021 at 7:46 PM Rajan Dhabalia <rdhaba...@apache.org>
> wrote:
>
> > *Hi, >> Delayed messages or certain user logic can introduce a lot of
> > message-holes. We have seen this issue in quite a lot of customers'
> > production environment.I agree that we need a larger buffer to store and
> > recover individually deleted messages, and the existing 150K limit might
> > not be enough for many usescases. However, I would also like to highlight
> > issues that happened in the past due to keeping a large number of deleted
> > messages in the broker’s memory and one of them was high GC pauses.
> > Therefore, we introduced ConcurrentOpenLongPairRangeSet
> > <https://github.com/apache/pulsar/pull/3818> to manage deleted messages
> > without actually storing range objects in memory. OpenRangeSet uses
> bitset
> > to store ranges in memory and we can also utilize it to persist in disk
> for
> > the recovery. This approach has various advantages: simple
> implementation,
> > large enough range for recovery, and it skips intermediate conversion
> from
> > unack-messages to bitset in OpenRangeSet which saves extra CPU while
> > recovery.I implemented a simple prototype
> > <
> >
> https://github.com/rdhabalia/pulsar/commit/1f8e5e745e9f1d1429697b5dee1da70545385653
> > >
> > to store deleted messages using bitset in OpenRangeSet and we can persist
> > 10M ranges with 5MB data size which I guess is large enough for any
> > usecases. So, we can use this approach to solve the problem without
> > introducing unnecessary complexity in managed-cursor.Thanks,Rajan*
> >
> > On Fri, Jan 22, 2021 at 7:52 PM Sijie Guo <guosi...@gmail.com> wrote:
> >
> > > Joe - Delayed messages or certain user logic can introduce a lot of
> > message
> > > holes. We have seen this issue in quite a lot of customers' production
> > > environment. Hence we need to find a solution for solving these
> problems.
> > > If you are skeptical of an implementation like that, how about us
> making
> > > cursor implementation pluggable. We can make this proposal implemented
> as
> > > one plugin. So it will not impact any existing logic but allowing
> people
> > > use a plugin to solve this problem.
> > >
> > > Thanks,
> > > Sijie
> > >
> > > On Fri, Jan 22, 2021 at 5:00 PM Joe Francis
> > <j...@verizonmedia.com.invalid
> > > >
> > > wrote:
> > >
> > > > Let me take a step back and explain  how I am looking at this from a
> > > > high-level
> > > > design viewpoint
> > > >
> > > >
> > > > Bookkeeper (BK) is like an LSM implementation of a KV store. Writes
> to
> > > all
> > > > keys are appended to a single file; deletes are logical.  Compaction
> > > > reclaims space.  An Index is used locate entries, tracking logical
> > > deletes
> > > > and reclaim space.
> > > >
> > > >
> > > > The index in BK  is another LSM.  Again, writes are appended, deletes
> > are
> > > > logical, and  an index is used to  locate entries , account for
> deletes
> > > and
> > > > compaction to reclaim space (the implementation within rocksdb is far
> > > more
> > > > complex with bloom filters and memtables, but you get the idea )   BK
> > > just
> > > > uses a sophisticated index (rocksdb) which is tiny and cacheable and
> > > > rocksdb has within it a sophisticated index which is small and
> > cacheable
> > > >
> > > >
> > > > So when I look at this proposal, what I see is the same - another
> > attempt
> > > > to build an LSM with a sophisticated index/cache mechanism using log
> > > > structured storage. So I am quite skeptical that this needs to solved
> > > this
> > > > way,  within Pulsar.
> > > >
> > > >
> > > >
> > > > Joe
> > > >
> > > > On Wed, Jan 20, 2021 at 12:30 AM linlin <lin...@apache.org> wrote:
> > > >
> > > > > We can look at
> ManagedCursorImpl.buildIndividualDeletedMessageRanges
> > > > >
> > > > > What is saved in the entry is not a bitSet, but a messageRange one
> by
> > > > one,
> > > > > which contains information such as ledgerId and entryId. BitSet
> only
> > > > exists
> > > > > in the memory and is used to quickly determine whether it already
> > > exists.
> > > > > In addition, the position of each ack will be stored in the
> > > > > individualDeletedMessages queue. When persisted to the entry, the
> > queue
> > > > > will be traversed, and the position information of each ack will
> > > > generate a
> > > > > messageRange.
> > > > > A messageRange contains lowerEndpoint (ledgerId+entryId),
> > upperEndpoint
> > > > > (ledgerId+entryId), 4 longs, about 256 bits.
> > > > >
> > > > > We assume a more extreme scenario, 300K messages, every other ack
> has
> > > an
> > > > > unacknowledged, that is, 150K location information will be stored
> in
> > > > > individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> > > > > Of course, there are also scenarios where the customer's ack spans
> > > > several
> > > > > ledgers.
> > > > >
> > > > >
> > > > > On 2021/01/20 00:38:47, Joe F <j...@gmail.com> wrote:
> > > > > > I have a simpler question. Just storing the message-ids raw will
> > fit
> > > > > ~300K>
> > > > > > entries in one ledger entry. With the bitmap  changes, we can
> store
> > > a>
> > > > > > couple of million  within one 5MB ledger entry.  So can you tell
> us
> > > > what>
> > > > > > numbers of unacked messages are  creating a problem?  What
> exactly
> > > are
> > > > > the>
> > > > > > issues you face, and at what numbers of unacked messages/memory
> use
> > > > etc?>
> > > > > >
> > > > > > I have my own concerns about this proposal, but I would like to
> > > > > understand>
> > > > > > the problem first>
> > > > > >
> > > > > > Joe>
> > > > > >
> > > > > > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <gu...@gmail.com>
> > wrote:>
> > > > > >
> > > > > > > Hi Lin,>
> > > > > > >>
> > > > > > > Thanks you and Penghui for drafting this! We have seen a lot of
> > > pain
> > > > > points>
> > > > > > > of `managedLedgerMaxUnackedRangesToPersist` when enabling
> delayed
> > > > > messages.>
> > > > > > > Glad that you and Penghui are spending time on resolving this!>
> > > > > > >>
> > > > > > > Overall the proposal looks good. But I have a couple of
> questions
> > > > about
> > > > > the>
> > > > > > > proposal.>
> > > > > > >>
> > > > > > > 1. What happens if the broker fails to write the entry marker?
> > For
> > > > > example,>
> > > > > > > at t0, the broker flushes dirty pages and successfully writes
> an
> > > > entry>
> > > > > > > marker. At t1, the broker tries to flushes dirty pages but
> failed
> > > to
> > > > > write>
> > > > > > > the new entry marker. How can you recover the entry marker?>
> > > > > > >>
> > > > > > > 2.  When a broker crashes and recovers the managed ledger, the
> > > > cursor>
> > > > > > > ledger is not writable anymore. Are you going to create a new
> > > cursor
> > > > > ledger>
> > > > > > > and copy all the entries from the old cursor ledger to the new
> > > one?>
> > > > > > >>
> > > > > > > It would be good if you can clarify these two questions.>
> > > > > > >>
> > > > > > > - Sijie>
> > > > > > >>
> > > > > > > On Sun, Jan 17, 2021 at 9:48 PM linlin <li...@apache.org>
> > wrote:>
> > > > > > >>
> > > > > > > > Hi, community:>
> > > > > > > >     Recently we encountered some problems when using
> > individual>
> > > > > > > > acknowledgments, such as:>
> > > > > > > > when the amount of acknowledgment is large, entry writing
> > fails;
> > > a
> > > > > large>
> > > > > > > > amount of cache causes OOM, etc.>
> > > > > > > > So I drafted a PIP in `>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> > <
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
> > > <
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >
> > > > <
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >
> > > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > >
> > > > > >
> > > > >
> > > > > > > <
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > >>
> > > > >
> > > > > > > > <>
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > >
> > > > >
> > > > > > > >>
> > > > > > > > ,>
> > > > > > > > any voice is welcomed.>
> > > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Reply via email to