> Need to replace (place link) with link.

I replaced the `Motivation` with your advice.

> We discussed adding the subscription name which triggered the time limit
to
> Topics.getStats().
> Why?

Since we have `pulsar_storage_backlog_eviction_count`,
I think we don't need to expose the subscription name which triggered the
backlog eviction.

> I have to run getStats(getEarliestTimeInBacklog=true) and it's way more
> expensive than the proposal above, since it needs to reach the earliest
> message for *each* subscription.

I don't think we need to save these expenses, it is only triggered when the
user requests.
 If the user does not set `getEarliestTimeInBacklog` to true, there will be
no such overhead.
We don't need to add complexity for very few calls

> Also a bit less accurate - you want to get the subscription cached that
> triggered it, using the same number to find it. Earliest backlog is
> accurate but if the configuration flag is off, it's not the same number as
> getStats.

Such problems do exist. Maybe there are many backlogs when the user
receives the alert,
but the backlogs have been reduced when the endpoint(Topics#getStats) is
requested.
There is a time difference between them. However, when the user receives an
alarm, it is only a notification.
 When the user requests the endpoint, they may take action.
I think it is reasonable to provide users with a more accurate backlog
before they act.

Thanks,
Tao Jiuming

Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月14日周二 16:51写道:

> >
> > Pulsar has a feature called backlog quota (place link)
>
> Need to replace (place link) with link.
>
>
>
> >    1. Find the backlog subscriptions
> >    After received the alarm, users could request
> Topics#getStats(topicName,
> >    true/false, true, true)
> >    <
> https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139>
> to
> >    get the topic stats, and find which subscriptions are in backlog.
> >    Pulsar exposed backlogSize and earliestMsgPublishTimeInBacklog in the
> >    subscription level, and we will expose backlogQuotaSizeBytes and
> >    backlogQuotaTimeSeconds in the topic level, so users could find which
> >    subscriptions in backlog easily.
> >
> > We have forgotten the other comment.
> We discussed adding the subscription name which triggered the time limit to
> Topics.getStats().
> Why?
>
> I have to run getStats(getEarliestTimeInBacklog=true) and it's way more
> expensive than the proposal above, since it needs to reach the earliest
> message for *each* subscription.
> Also a bit less accurate - you want to get the subscription cached that
> triggered it, using the same number to find it. Earliest backlog is
> accurate but if the configuration flag is off, it's not the same number as
> getStats.
>
>
> Nice to have (not mandatory) additions:
>
> I would add before
>
> >
> >    1. After readEntryComplete
> >    <
> https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L2780
> >,
> >    cache its result:
> >
> > When this configuration flag is set to true, the broker does an I/O call
> by reading the oldest entry to get its write timestamp. Once we have that,
> we'll add caching to that value since we're going to use it for returning
> the age.
>
> I would add before:
>
> > slowestReaderTimeBasedBacklogQuotaCheck
> > <
> https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L2817>
> is
> > a totally in-memory method, we just need to cache the
> >
>
> When this configuration flag is set to false, the check uses an estimate of
> the oldest entry timestamp, by taking the closing time of the ledger which
> the message is contained at.
>
> On Fri, Mar 10, 2023 at 8:29 AM 太上玄元道君 <dao...@apache.org> wrote:
>
> > I think yes, to avoid missing something, you can take a look if you have
> > time.
> >
> > Thanks,
> > Tao Jiuming
> >
> > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月9日周四 17:40写道:
> >
> > > Is the PIP updated with all comments?
> > >
> > > On Thu, Mar 9, 2023 at 8:59 AM 太上玄元道君 <dao...@apache.org> wrote:
> > >
> > > > > backlogQuotaLimitSize
> > > > > should be `backlogQuotaSizeBytes`
> > > >
> > > > > backlogQuotaLimitTime
> > > > > should be `backlogQuotaTimeSeconds`
> > > >
> > > > > So you need to rename the metric.
> > > > > "pulsar_storage_backlog_quota_count" -->
> > > > > `pulsar_storage_backlog_eviction_count`
> > > >
> > > > > the topic's existing subscription.
> > > > > "subscription" --> "subscription*s*"
> > > >
> > > > > Number of backlog quota happends.
> > > > > Number of times backlog evictions happened due to exceeding backlog
> > > quota
> > > > > (either time or size).
> > > >
> > > > Accepted, if there is no more need to change, I'll start the vote
> next
> > > > week.
> > > >
> > > > Thanks,
> > > > Tao Jiuming
> > > >
> > > >
> > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月7日周二 00:02写道:
> > > >
> > > > > >
> > > > > > Pulsar has a feature called backlog quota (place link).
> > > > >
> > > > > You need to place a link :)
> > > > >
> > > > > Expose pulsar_storage_backlog_quota_count in the topic leve
> > > > >
> > > > > You already have "pulsar_storage_backlog_size", so why do you need
> > this
> > > > > metric for?
> > > > >
> > > > > backlogQuotaLimitSize
> > > > >
> > > > > should be `backlogQuotaSizeBytes`
> > > > >
> > > > > backlogQuotaLimitTime
> > > > >
> > > > > should be `backlogQuotaTimeSeconds`
> > > > >
> > > > > What about goal no.4? Expose oldest unacknowledged message
> > subscription
> > > > > name?
> > > > >
> > > > > IMO, metrics are like API - perhaps indicate the change there as
> well
> > > > >
> > > > > Record the event when dropBacklogForSizeLimit
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BacklogQuotaManager.java#L121
> > > > > >
> > > > > >  or dropBacklogForTimeLimit
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BacklogQuotaManager.java#L194
> > > > >
> > > > > is
> > > > > > going to invoked.
> > > > >
> > > > >
> > > > > Oh, now I get it.
> > > > > So you need to rename the metric.
> > > > > "pulsar_storage_backlog_quota_count" -->
> > > > > `pulsar_storage_backlog_eviction_count`
> > > > >
> > > > >
> > > > > > the topic's existing subscription.
> > > > >
> > > > > "subscription" --> "subscription*s*"
> > > > >
> > > > > Number of backlog quota happends.
> > > > >
> > > > > Number of times backlog evictions happened due to exceeding backlog
> > > quota
> > > > > (either time or size).
> > > > >
> > > > >
> > > > > >    1. Find the backlog subscriptions
> > > > > >    After received the alarm, users could request
> > > > > Topics#getStats(topicName,
> > > > > >    true/false, true, true)
> > > > > >    <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139
> > > > >
> > > > > to
> > > > > >    get the topic stats, and find which subscriptions are in
> > backlog.
> > > > > >    Pulsar exposed backlogSize and earliestMsgPublishTimeInBacklog
> > in
> > > > the
> > > > > >    subscription level, and we will expose backlogQuotaLimitSize
> and
> > > > > >    backlogQuotaLimitTime in the topic level, so users could find
> > > which
> > > > > >    subscriptions in backlog easily.
> > > > > >
> > > > > > I wrote how it should be done IMO in a previous email.
> > > > >
> > > > >
> > > > > On Mon, Mar 6, 2023 at 1:20 PM 太上玄元道君 <dao...@apache.org> wrote:
> > > > >
> > > > > > Hi Aasf,
> > > > > > I've updated the PIP, PTAL
> > > > > >
> > > > > > Thanks,
> > > > > > Tao Jiuming
> > > > > >
> > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月5日周日 21:00写道:
> > > > > >
> > > > > > > On Thu, Mar 2, 2023 at 12:57 PM 太上玄元道君 <dao...@apache.org>
> > wrote:
> > > > > > >
> > > > > > > > > I  think you should fix this explanation:
> > > > > > > >
> > > > > > > > Thanks! I would like to copy the context you provide to the
> PIP
> > > > > > > motivation,
> > > > > > > > your description is more detailed, so developers don't have
> to
> > go
> > > > > > through
> > > > > > > > the code.
> > > > > > > >
> > > > > > >
> > > > > > > Sure
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > Today the quota is checked periodically, right? So that's
> how
> > > the
> > > > > > > > operator
> > > > > > > > > knows the cost in terms of I/O is limited.
> > > > > > > > > Now you are adding one additional I/O per collection,
> every 1
> > > min
> > > > > by
> > > > > > > > > default. That's a lot perhaps. How long is the check
> interval
> > > > > today?
> > > > > > > >
> > > > > > > > Actually, I don't want to introduce additional costs, I
> thought
> > > we
> > > > > > > > could cache its result, so that it won't introduce additional
> > > > costs.
> > > > > > > > It may be that I did not make it clear in the PIP and caused
> > this
> > > > > > > > misunderstanding, sorry.
> > > > > > > >
> > > > > > >
> > > > > > > Ok, just to verify: You plan to modify the code that runs
> > > > periodically
> > > > > > the
> > > > > > > backlog quota check, so the result will be cached there? This
> way
> > > > when
> > > > > > you
> > > > > > > pull that information from that code every 1min to expose it
> as a
> > > > > metric
> > > > > > it
> > > > > > > will have 0 I/O cost?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > The user today can calculate quota used for size based
> limit,
> > > > since
> > > > > > > there
> > > > > > > > > are two metrics that are exposed today on a topic level: "
> > > > > > > > > pulsar_storage_backlog_quota_limit" and
> > > > > > "pulsar_storage_backlog_size".
> > > > > > > > You
> > > > > > > > > can just divide the two to get a percentage.
> > > > > > > > > For the time-based limit, the only metric exposed today is
> > > quota
> > > > > > > itself ,
> > > > > > > > "
> > > > > > > > > pulsar_storage_backlog_quota_limit_time".
> > > > > > > >
> > > > > > > > I only noticed `pulsar_storage_backlog_size` but missed
> > > > > > > > `pulsar_storage_backlog_quota_limit` and
> > > > > > > > `pulsar_storage_backlog_quota_limit_time`. Many thanks for
> your
> > > > > > reminder.
> > > > > > > >
> > > > > > > >
> > > > > > > > So, in this condition, we already have the following
> > topic-level
> > > > > > metrics:
> > > > > > > > `pulsar_storage_backlog_size`: The total backlog size of the
> > > topics
> > > > > of
> > > > > > > this
> > > > > > > > topic owned by this broker (in bytes).
> > > > > > > > `pulsar_storage_backlog_quota_limit`: The total amount of the
> > > data
> > > > in
> > > > > > > this
> > > > > > > > topic that limits the backlog quota (bytes).
> > > > > > > > `pulsar_storage_backlog_quota_limit_time`: The backlog quota
> > > limit
> > > > in
> > > > > > > > time(seconds). (This metric does not exists in the doc, need
> to
> > > > > > improve)
> > > > > > > >
> > > > > > > >
> > > > > > > > We just need to add a new metric named
> > > > > > > > `pulsar_storage_earliest_msg_publish_time_in_backlog` in the
> > > > > > topic-level
> > > > > > > > that indicates the publish time of the earliest message in
> the
> > > > > backlog.
> > > > > > > > So users could get
> `pulsar_backlog_size_quota_used_percentage`
> > by
> > > > > > divide
> > > > > > > > `pulsar_storage_backlog_size ` and
> > > > > > > >
> > > `pulsar_storage_backlog_quota_limit`(`pulsar_storage_backlog_size`
> > > > /
> > > > > > > > `pulsar_storage_backlog_quota_limit`),
> > > > > > > > and could get `pulsar_backlog_time_quota_used_percentage` by
> > > divide
> > > > > > `now
> > > > > > > -
> > > > > > > > pulsar_storage_earliest_msg_publish_time_in_backlog` and
> > > > > > > > `pulsar_storage_backlog_quota_limit_time` (`now -
> > > > > > > > pulsar_storage_earliest_msg_publish_time_in_backlog` /
> > > > > > > > `pulsar_storage_backlog_quota_limit_time`).
> > > > > > > >
> > > > > > >
> > > > > > > I think there is a problem with the name
> > > > > > > `pulsar_storage_earliest_msg_publish_time_in_backlog` in the
> > > > > topic-level:
> > > > > > > * First, I prefer exposing the age rather than the publish
> time.
> > > > > > > * Second, it's a bit hard to figure out the meaning of the
> > earliest
> > > > msg
> > > > > > in
> > > > > > > the backlog.
> > > > > > >
> > > > > > > Maybe `pulsar_storage_backlog_age_seconds`? In the explanation
> > you
> > > > can
> > > > > > > write: "The age (time passed since it was published) of the
> > > earliest
> > > > > > > unacknowledged message based on the topic's
> > > > > > > existing subscriptions" ?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > The backlog quota time checker runs periodically, so we can
> > cache
> > > > its
> > > > > > > > result, so it won't lead to much costs.
> > > > > > > >
> > > > > > > > Pulsar also exposed subscription-level  `backlogSize` and
> > > > > > > > `earliestMsgPublishTimeInBacklog` in Pulsar-Admin
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139
> > > > > > > > >
> > > > > > > > if
> > > > > > > > `subscriptionBacklogSize` and `getEarliestTimeInBacklog` are
> > > true.
> > > > > > > > We can also expose `backlogQuotaLimiteSize` and
> > > > > `backlogQuotaLimitTime`
> > > > > > > of
> > > > > > > > the topic to PulsarAdmin.
> > > > > > > >
> > > > > > >
> > > > > > > What is the relationship you see between Pulsar exposing
> > > > > > > subscriptionBacklogSize and earliestMsgPublishTimeInBacklog in
> > > > > > > subscription, to exposing the backlog quota limits in pulsar
> > admin?
> > > > > > >
> > > > > > > Limits can be exposed to Pulsar Admin, since it has 0 cost
> > > associated
> > > > > > with
> > > > > > > it.
> > > > > > > I think it's a good idea to do that.
> > > > > > > The quota usage can also be exposed to pulsar admin, since we
> > pull
> > > > that
> > > > > > > data from the backlog quota checker cache, so it has 0 cost as
> > > well.
> > > > > > >
> > > > > > > As we said in previous email we can also expose
> > > > > > > `backlogQuotaTimeOldestBacklogAgeSubscriptionName`
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > After users receive the backlog alert from metrics alerting
> > > > systems,
> > > > > > they
> > > > > > > > can get the topic name, then, they can request
> Topics#getStats
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139
> > > > > > > > >
> > > > > > > > to
> > > > > > > > get which subscriptions are in the huge backlog.
> > > > > > > >
> > > > > > > >
> > > > > > > I agree users can use PulsarAdmin getStats for topic , with
> > > > > > > getEarliestTimeInBacklog=true to find the oldest subscription
> > > > > responsible
> > > > > > > for exceeding quota, but we can give them that information
> with 0
> > > > cost
> > > > > > > since we already have that subscription name cached (we spent
> the
> > > I/O
> > > > > to
> > > > > > > find out who that subscription is, let's just cache it and
> > provide
> > > > it).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Tao Jiuming
> > > > > > > >
> > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月1日周三 23:42写道:
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Pulsar has 2 configurations for the backlog eviction
> > > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://pulsar.apache.org/docs/2.11.x/cookbooks-retention-expiry/#backlog-quotas
> > > > > > > > > >
> > > > > > > > > > : backlogQuotaDefaultLimitBytes and
> > > > > backlogQuotaDefaultLimitSecond.
> > > > > > > > > > By default, backlog eviction is disabled, and also, there
> > is
> > > a
> > > > > > field
> > > > > > > > > named
> > > > > > > > > > backlogQuotaMap in TopicPolicies
> > > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/pulsar-common/src/main/java/org/apache/pulsar/common/policies/data/HierarchyTopicPolicies.java#L45
> > > > > > > > > >
> > > > > > > > > > /NamespaceSpacePolicies
> > > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/common/policies/data/Policies.java#L41
> > > > > > > > >
> > > > > > > > > assists
> > > > > > > > > > in controlling Topic/Namespace level backlog quota.
> > > > > > > > > >
> > > > > > > > > > If topic backlog reaches the threshold of any item,
> backlog
> > > > > > eviction
> > > > > > > > will
> > > > > > > > > > be triggered, Pulsar will move subscription's cursor to
> > skip
> > > > > > > > > unacknowledged
> > > > > > > > > > messages.
> > > > > > > > > >
> > > > > > > > > > Before backlog eviction happens, we don't have a metric
> to
> > > > > monitor
> > > > > > > how
> > > > > > > > > > long that it can reaches the threshold.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I  think you should fix this explanation:
> > > > > > > > >
> > > > > > > > > In Pulsar, a subscription maintains a state of message
> > > > > acknowledged.
> > > > > > A
> > > > > > > > > subscription backlog is the set of messages which are
> > > > > unacknowledged.
> > > > > > > > > A subscription backlog size is the sum of size of
> > > unacknowledged
> > > > > > > messages
> > > > > > > > > (in bytes).
> > > > > > > > > A topic can have many subscriptions.
> > > > > > > > > A topic backlog is defined as the backlog size of the
> > > > subscription
> > > > > > > which
> > > > > > > > > has the oldest unacknowledged message. Since acknowledged
> > > > messages
> > > > > > can
> > > > > > > be
> > > > > > > > > interleaved with unacknowledged messages, calculating the
> > exact
> > > > > size
> > > > > > of
> > > > > > > > > that subscription can be expensive as it requires I/O
> > > operations
> > > > to
> > > > > > > read
> > > > > > > > > from the messages from the ledgers.
> > > > > > > > > For that reason, the topic backlog is actually defined to
> be
> > > the
> > > > > > > > estimated
> > > > > > > > > backlog size of that subscription. It does so by
> summarizing
> > > the
> > > > > size
> > > > > > > of
> > > > > > > > > all the ledgers, starting from the current active one, up
> to
> > > the
> > > > > > ledger
> > > > > > > > > which contains the oldest unacknowledged message (There is
> > > > > actually a
> > > > > > > > > faster way to calculate it, but this is the definition of
> the
> > > > > > > > estimation).
> > > > > > > > >
> > > > > > > > > A topic backlog age is the age of the oldest unacknowledged
> > > > message
> > > > > > (in
> > > > > > > > any
> > > > > > > > > subscription). If that message was written 30 minutes ago,
> > its
> > > > age
> > > > > is
> > > > > > > 30
> > > > > > > > > minutes.
> > > > > > > > >
> > > > > > > > > Pulsar has a feature called backlog quota (place link). It
> > > allows
> > > > > the
> > > > > > > > user
> > > > > > > > > to define a quota - in effect, a limit - which limits the
> > topic
> > > > > > > backlog.
> > > > > > > > > There are two types of quotas:
> > > > > > > > > * Size based: The limit is for the topic backlog size (as
> we
> > > > > defined
> > > > > > > > > above).
> > > > > > > > > * Time based: The limit is for the topic's backlog age (as
> we
> > > > > defined
> > > > > > > > > above).
> > > > > > > > >
> > > > > > > > > Once a topic backlog exceeds either one of those limits, an
> > > > action
> > > > > is
> > > > > > > > taken
> > > > > > > > > upon messages written to the topic:
> > > > > > > > > * The producer write is placed on hold for a certain amount
> > of
> > > > time
> > > > > > > > before
> > > > > > > > > failing.
> > > > > > > > > * The producer write is failed
> > > > > > > > > * The subscriptions oldest unacknowledged messages will be
> > > > > > acknowledged
> > > > > > > > in
> > > > > > > > > order until both the topic backlog size or age will fall
> > inside
> > > > the
> > > > > > > limit
> > > > > > > > > (quota). The process is called backlog eviction (happens
> > every
> > > > > > > interval)
> > > > > > > > >
> > > > > > > > > The quotas can be defined as a default value for any topic,
> > by
> > > > > using
> > > > > > > the
> > > > > > > > > following broker configuration keys:
> > > > backlogQuotaDefaultLimitBytes
> > > > > ,
> > > > > > > > > backlogQuotaDefaultLimitSecond. It can also be specified
> > > directly
> > > > > for
> > > > > > > all
> > > > > > > > > topics in a given namespace using the namespace policy, or
> a
> > > > > specific
> > > > > > > > topic
> > > > > > > > > using a topic policy.
> > > > > > > > >
> > > > > > > > > The user today can calculate quota used for size based
> limit,
> > > > since
> > > > > > > there
> > > > > > > > > are two metrics that are exposed today on a topic level: "
> > > > > > > > > pulsar_storage_backlog_quota_limit" and
> > > > > > "pulsar_storage_backlog_size".
> > > > > > > > You
> > > > > > > > > can just divide the two to get a percentage.
> > > > > > > > > For the time-based limit, the only metric exposed today is
> > > quota
> > > > > > itself
> > > > > > > > , "
> > > > > > > > > pulsar_storage_backlog_quota_limit_time".
> > > > > > > > >
> > > > > > > > > ------------
> > > > > > > > >
> > > > > > > > > I would create two metrics:
> > > > > > > > >
> > > > > > > > > `pulsar_backlog_size_quota_used_percentage`
> > > > > > > > > `pulsar_backlog_time_quota_used_percentage`
> > > > > > > > >
> > > > > > > > > You would like to know what triggered the alert, hence two.
> > > > > > > > > It's not the quota percentage, it's the quota used
> > percentage.
> > > > > > > > >
> > > > > > > > > ----------
> > > > > > > > >
> > > > > > > > > It checks if the backlog size exceeds the threshold(
> > > > > > > > > > backlogQuotaDefaultLimitBytes), and it gets the current
> > > backlog
> > > > > > size
> > > > > > > by
> > > > > > > > > > calculating LedgerInfo
> > > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/proto/MLDataFormats.proto#L54
> > > > > > > > > >,
> > > > > > > > > > it will not lead to I/O.
> > > > > > > > >
> > > > > > > > > This is not correct.
> > > > > > > > > It checks against the topic / namespace policy, and if it
> > > doesn't
> > > > > > > exist,
> > > > > > > > it
> > > > > > > > > falls back on the default configuration key mentioned
> above.
> > > > > > > > >
> > > > > > > > > It checks if the backlog time exceeds the threshold(
> > > > > > > > > > backlogQuotaDefaultLimitSecond). If
> > > > > > preciseTimeBasedBacklogQuotaCheck
> > > > > > > > is
> > > > > > > > > > set to be true, it will read an entry from Bookkeeper,
> but
> > > the
> > > > > > > default
> > > > > > > > > > value is false, which means it gets the backlog time by
> > > > > calculating
> > > > > > > > > > LedgerInfo
> > > > > > > > > > <
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/proto/MLDataFormats.proto#L54
> > > > > > > > > >.
> > > > > > > > > > So in general, we don't need to worry about it will lead
> to
> > > > I/O.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm afraid of that.
> > > > > > > > > Today the quota is checked periodically, right? So that's
> how
> > > the
> > > > > > > > operator
> > > > > > > > > knows the cost in terms of I/O is limited.
> > > > > > > > >  Now you are adding one additional I/O per collection,
> every
> > 1
> > > > min
> > > > > by
> > > > > > > > > default. That's a lot perhaps. How long is the check
> interval
> > > > > today?
> > > > > > > > >
> > > > > > > > > Perhaps in the backlog quota check, you can persist the
> check
> > > > > result,
> > > > > > > and
> > > > > > > > > use it? Persist the age that is.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ------
> > > > > > > > >
> > > > > > > > > Regarding "slowest_subscription"
> > > > > > > > > I think the cost is too high, because the subscriptions
> will
> > > keep
> > > > > > > > > alternating, which can generate so many unique time series.
> > > Since
> > > > > > > > > Prometheus flush only every 2 hours, or any there TSDB, it
> > will
> > > > > cost
> > > > > > > you
> > > > > > > > > too much.
> > > > > > > > >
> > > > > > > > > I suggest exposing the name via the topic stats. This way
> > they
> > > > can
> > > > > > > issue
> > > > > > > > a
> > > > > > > > > REST call to grab that subscription name only when the
> alert
> > > > fires.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Asaf
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Feb 28, 2023 at 9:29 AM 太上玄元道君 <dao...@apache.org>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Asaf,
> > > > > > > > > > I've updated the PIP, PTAL
> > > > > > > > > >
> > > > > > > > > > Thank,
> > > > > > > > > > Tao Jiuming
> > > > > > > > > >
> > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年2月26日周日
> 23:03写道:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > Pulsar has 2 configurations for the backlog eviction:
> > > > > > > > > > > > backlogQuotaDefaultLimitBytes and
> > > > > > backlogQuotaDefaultLimitSecond,
> > > > > > > > if
> > > > > > > > > > > > topic backlog reaches the threshold of any item,
> > backlog
> > > > > > eviction
> > > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > > triggered.
> > > > > > > > > > >
> > > > > > > > > > > This seems like default values, not the actual values.
> > Can
> > > > you
> > > > > > > please
> > > > > > > > > > > provide an explanation in the PIP and link to read
> more:
> > > > > > > > > > > 1. Where do you define the backlog quota exactly? What
> is
> > > the
> > > > > > > > > granularity
> > > > > > > > > > > (subscription?)
> > > > > > > > > > > 2.  Is the backlog quota on by default? If so, what are
> > the
> > > > > > default
> > > > > > > > > > values?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *Notes*
> > > > > > > > > > > 1. When the backlog quota limit is defined in Bytes,
> and
> > > you
> > > > > wish
> > > > > > > to
> > > > > > > > > know
> > > > > > > > > > > how close a subscription is to its bytes limit, you
> need
> > to
> > > > > > > calculate
> > > > > > > > > the
> > > > > > > > > > > backlog size in bytes. From my understanding, there is
> an
> > > > > > accurate
> > > > > > > > > > > calculation (which is costly in terms of I/O) and there
> > is
> > > an
> > > > > > > > estimate
> > > > > > > > > of
> > > > > > > > > > > it. I presume you would want to use the estimated one,
> is
> > > > that
> > > > > > > > correct?
> > > > > > > > > > > The backlog quota itself, uses the accurate or the
> > > estimated
> > > > > when
> > > > > > > it
> > > > > > > > > > starts
> > > > > > > > > > > evicting entries (i.e. marking them as acknowledged)?
> > > > > > > > > > >
> > > > > > > > > > > 2. For the backlog limit specifying in time units,
> there
> > is
> > > > no
> > > > > > > > > estimate,
> > > > > > > > > > as
> > > > > > > > > > > it must be calculated all the time (earliest
> > unacknowledged
> > > > > > message
> > > > > > > > > > > distance from now). How do you plan to calculate the
> > > current
> > > > > age
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > earliest message without bearing that I/O cost on each
> > > metric
> > > > > > > > > > calculation?
> > > > > > > > > > >
> > > > > > > > > > > 3. In the Goal section, you specify that your goal is
> to
> > > add
> > > > a
> > > > > > > > > > "proximity"
> > > > > > > > > > > metric.
> > > > > > > > > > > a) You must define that - what is proximity metric
> > exactly?
> > > > > What
> > > > > > > are
> > > > > > > > > its
> > > > > > > > > > > units? How are you planning to calculate it?
> > > > > > > > > > > b) Proximity is not a good term IMO. I personally have
> > > never
> > > > > seen
> > > > > > > > this
> > > > > > > > > > term
> > > > > > > > > > > used in software systems, unless it's in the
> > aviation/space
> > > > > > > industry.
> > > > > > > > > > Once
> > > > > > > > > > > you explain (a) I hope I can help provide alternative
> > > names.
> > > > > > > > > > >
> > > > > > > > > > > 4. Maybe we should provide the used quota percentage
> for
> > > both
> > > > > > > limits,
> > > > > > > > > > > instead of one per both, since it's easier to act upon
> > the
> > > > > alert
> > > > > > > when
> > > > > > > > > you
> > > > > > > > > > > need which one triggered it.
> > > > > > > > > > >
> > > > > > > > > > > 5. I didn't understand the "slowest_subscription" label
> > > used
> > > > > when
> > > > > > > > > > > describing the metric label. Can you please provide an
> > > > > > explanation?
> > > > > > > > > > >
> > > > > > > > > > > 6. I suggest writing a "High Level Design" section, and
> > add
> > > > > > > > everything
> > > > > > > > > > you
> > > > > > > > > > > need to know for this proposal, so I don't need to read
> > the
> > > > > > > > > > > implementation details below (code).
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Asaf
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Feb 22, 2023 at 4:52 PM 太上玄元道君 <
> > dao...@apache.org>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > I've started a PIP to discuss: PIP-248 Add backlog
> > > eviction
> > > > > > > metric
> > > > > > > > > > > >
> > > > > > > > > > > > ### Motivation:
> > > > > > > > > > > >
> > > > > > > > > > > > Pulsar has 2 configurations for the backlog eviction:
> > > > > > > > > > > > `backlogQuotaDefaultLimitBytes` and
> > > > > > > > `backlogQuotaDefaultLimitSecond`,
> > > > > > > > > > if
> > > > > > > > > > > > topic backlog reaches the threshold of any item,
> > backlog
> > > > > > eviction
> > > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > > triggered.
> > > > > > > > > > > >
> > > > > > > > > > > > Before backlog eviction happens, we don't have a
> metric
> > > to
> > > > > > > monitor
> > > > > > > > > how
> > > > > > > > > > > long
> > > > > > > > > > > > that it can reaches the threshold.
> > > > > > > > > > > >
> > > > > > > > > > > > We can provide a progress bar metric to tell users
> some
> > > > > topics
> > > > > > is
> > > > > > > > > about
> > > > > > > > > > > to
> > > > > > > > > > > > trigger backlog eviction. And users can subscribe the
> > > alert
> > > > > to
> > > > > > > > > schedule
> > > > > > > > > > > > consumers.
> > > > > > > > > > > >
> > > > > > > > > > > > For more details, please read the PIP at
> > > > > > > > > > > > https://github.com/apache/pulsar/issues/19601
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Tao Jiuming
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to