> > Pulsar has a feature called backlog quota (place link) Need to replace (place link) with link.
> 1. Find the backlog subscriptions > After received the alarm, users could request Topics#getStats(topicName, > true/false, true, true) > > <https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139> > to > get the topic stats, and find which subscriptions are in backlog. > Pulsar exposed backlogSize and earliestMsgPublishTimeInBacklog in the > subscription level, and we will expose backlogQuotaSizeBytes and > backlogQuotaTimeSeconds in the topic level, so users could find which > subscriptions in backlog easily. > > We have forgotten the other comment. We discussed adding the subscription name which triggered the time limit to Topics.getStats(). Why? I have to run getStats(getEarliestTimeInBacklog=true) and it's way more expensive than the proposal above, since it needs to reach the earliest message for *each* subscription. Also a bit less accurate - you want to get the subscription cached that triggered it, using the same number to find it. Earliest backlog is accurate but if the configuration flag is off, it's not the same number as getStats. Nice to have (not mandatory) additions: I would add before > > 1. After readEntryComplete > > <https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L2780>, > cache its result: > > When this configuration flag is set to true, the broker does an I/O call by reading the oldest entry to get its write timestamp. Once we have that, we'll add caching to that value since we're going to use it for returning the age. I would add before: > slowestReaderTimeBasedBacklogQuotaCheck > <https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L2817> > is > a totally in-memory method, we just need to cache the > When this configuration flag is set to false, the check uses an estimate of the oldest entry timestamp, by taking the closing time of the ledger which the message is contained at. On Fri, Mar 10, 2023 at 8:29 AM 太上玄元道君 <dao...@apache.org> wrote: > I think yes, to avoid missing something, you can take a look if you have > time. > > Thanks, > Tao Jiuming > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月9日周四 17:40写道: > > > Is the PIP updated with all comments? > > > > On Thu, Mar 9, 2023 at 8:59 AM 太上玄元道君 <dao...@apache.org> wrote: > > > > > > backlogQuotaLimitSize > > > > should be `backlogQuotaSizeBytes` > > > > > > > backlogQuotaLimitTime > > > > should be `backlogQuotaTimeSeconds` > > > > > > > So you need to rename the metric. > > > > "pulsar_storage_backlog_quota_count" --> > > > > `pulsar_storage_backlog_eviction_count` > > > > > > > the topic's existing subscription. > > > > "subscription" --> "subscription*s*" > > > > > > > Number of backlog quota happends. > > > > Number of times backlog evictions happened due to exceeding backlog > > quota > > > > (either time or size). > > > > > > Accepted, if there is no more need to change, I'll start the vote next > > > week. > > > > > > Thanks, > > > Tao Jiuming > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月7日周二 00:02写道: > > > > > > > > > > > > > Pulsar has a feature called backlog quota (place link). > > > > > > > > You need to place a link :) > > > > > > > > Expose pulsar_storage_backlog_quota_count in the topic leve > > > > > > > > You already have "pulsar_storage_backlog_size", so why do you need > this > > > > metric for? > > > > > > > > backlogQuotaLimitSize > > > > > > > > should be `backlogQuotaSizeBytes` > > > > > > > > backlogQuotaLimitTime > > > > > > > > should be `backlogQuotaTimeSeconds` > > > > > > > > What about goal no.4? Expose oldest unacknowledged message > subscription > > > > name? > > > > > > > > IMO, metrics are like API - perhaps indicate the change there as well > > > > > > > > Record the event when dropBacklogForSizeLimit > > > > > < > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BacklogQuotaManager.java#L121 > > > > > > > > > > or dropBacklogForTimeLimit > > > > > < > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BacklogQuotaManager.java#L194 > > > > > > > > is > > > > > going to invoked. > > > > > > > > > > > > Oh, now I get it. > > > > So you need to rename the metric. > > > > "pulsar_storage_backlog_quota_count" --> > > > > `pulsar_storage_backlog_eviction_count` > > > > > > > > > > > > > the topic's existing subscription. > > > > > > > > "subscription" --> "subscription*s*" > > > > > > > > Number of backlog quota happends. > > > > > > > > Number of times backlog evictions happened due to exceeding backlog > > quota > > > > (either time or size). > > > > > > > > > > > > > 1. Find the backlog subscriptions > > > > > After received the alarm, users could request > > > > Topics#getStats(topicName, > > > > > true/false, true, true) > > > > > < > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139 > > > > > > > > to > > > > > get the topic stats, and find which subscriptions are in > backlog. > > > > > Pulsar exposed backlogSize and earliestMsgPublishTimeInBacklog > in > > > the > > > > > subscription level, and we will expose backlogQuotaLimitSize and > > > > > backlogQuotaLimitTime in the topic level, so users could find > > which > > > > > subscriptions in backlog easily. > > > > > > > > > > I wrote how it should be done IMO in a previous email. > > > > > > > > > > > > On Mon, Mar 6, 2023 at 1:20 PM 太上玄元道君 <dao...@apache.org> wrote: > > > > > > > > > Hi Aasf, > > > > > I've updated the PIP, PTAL > > > > > > > > > > Thanks, > > > > > Tao Jiuming > > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月5日周日 21:00写道: > > > > > > > > > > > On Thu, Mar 2, 2023 at 12:57 PM 太上玄元道君 <dao...@apache.org> > wrote: > > > > > > > > > > > > > > I think you should fix this explanation: > > > > > > > > > > > > > > Thanks! I would like to copy the context you provide to the PIP > > > > > > motivation, > > > > > > > your description is more detailed, so developers don't have to > go > > > > > through > > > > > > > the code. > > > > > > > > > > > > > > > > > > > Sure > > > > > > > > > > > > > > > > > > > > > > > > > > > Today the quota is checked periodically, right? So that's how > > the > > > > > > > operator > > > > > > > > knows the cost in terms of I/O is limited. > > > > > > > > Now you are adding one additional I/O per collection, every 1 > > min > > > > by > > > > > > > > default. That's a lot perhaps. How long is the check interval > > > > today? > > > > > > > > > > > > > > Actually, I don't want to introduce additional costs, I thought > > we > > > > > > > could cache its result, so that it won't introduce additional > > > costs. > > > > > > > It may be that I did not make it clear in the PIP and caused > this > > > > > > > misunderstanding, sorry. > > > > > > > > > > > > > > > > > > > Ok, just to verify: You plan to modify the code that runs > > > periodically > > > > > the > > > > > > backlog quota check, so the result will be cached there? This way > > > when > > > > > you > > > > > > pull that information from that code every 1min to expose it as a > > > > metric > > > > > it > > > > > > will have 0 I/O cost? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The user today can calculate quota used for size based limit, > > > since > > > > > > there > > > > > > > > are two metrics that are exposed today on a topic level: " > > > > > > > > pulsar_storage_backlog_quota_limit" and > > > > > "pulsar_storage_backlog_size". > > > > > > > You > > > > > > > > can just divide the two to get a percentage. > > > > > > > > For the time-based limit, the only metric exposed today is > > quota > > > > > > itself , > > > > > > > " > > > > > > > > pulsar_storage_backlog_quota_limit_time". > > > > > > > > > > > > > > I only noticed `pulsar_storage_backlog_size` but missed > > > > > > > `pulsar_storage_backlog_quota_limit` and > > > > > > > `pulsar_storage_backlog_quota_limit_time`. Many thanks for your > > > > > reminder. > > > > > > > > > > > > > > > > > > > > > So, in this condition, we already have the following > topic-level > > > > > metrics: > > > > > > > `pulsar_storage_backlog_size`: The total backlog size of the > > topics > > > > of > > > > > > this > > > > > > > topic owned by this broker (in bytes). > > > > > > > `pulsar_storage_backlog_quota_limit`: The total amount of the > > data > > > in > > > > > > this > > > > > > > topic that limits the backlog quota (bytes). > > > > > > > `pulsar_storage_backlog_quota_limit_time`: The backlog quota > > limit > > > in > > > > > > > time(seconds). (This metric does not exists in the doc, need to > > > > > improve) > > > > > > > > > > > > > > > > > > > > > We just need to add a new metric named > > > > > > > `pulsar_storage_earliest_msg_publish_time_in_backlog` in the > > > > > topic-level > > > > > > > that indicates the publish time of the earliest message in the > > > > backlog. > > > > > > > So users could get `pulsar_backlog_size_quota_used_percentage` > by > > > > > divide > > > > > > > `pulsar_storage_backlog_size ` and > > > > > > > > > `pulsar_storage_backlog_quota_limit`(`pulsar_storage_backlog_size` > > > / > > > > > > > `pulsar_storage_backlog_quota_limit`), > > > > > > > and could get `pulsar_backlog_time_quota_used_percentage` by > > divide > > > > > `now > > > > > > - > > > > > > > pulsar_storage_earliest_msg_publish_time_in_backlog` and > > > > > > > `pulsar_storage_backlog_quota_limit_time` (`now - > > > > > > > pulsar_storage_earliest_msg_publish_time_in_backlog` / > > > > > > > `pulsar_storage_backlog_quota_limit_time`). > > > > > > > > > > > > > > > > > > > I think there is a problem with the name > > > > > > `pulsar_storage_earliest_msg_publish_time_in_backlog` in the > > > > topic-level: > > > > > > * First, I prefer exposing the age rather than the publish time. > > > > > > * Second, it's a bit hard to figure out the meaning of the > earliest > > > msg > > > > > in > > > > > > the backlog. > > > > > > > > > > > > Maybe `pulsar_storage_backlog_age_seconds`? In the explanation > you > > > can > > > > > > write: "The age (time passed since it was published) of the > > earliest > > > > > > unacknowledged message based on the topic's > > > > > > existing subscriptions" ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The backlog quota time checker runs periodically, so we can > cache > > > its > > > > > > > result, so it won't lead to much costs. > > > > > > > > > > > > > > Pulsar also exposed subscription-level `backlogSize` and > > > > > > > `earliestMsgPublishTimeInBacklog` in Pulsar-Admin > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139 > > > > > > > > > > > > > > > if > > > > > > > `subscriptionBacklogSize` and `getEarliestTimeInBacklog` are > > true. > > > > > > > We can also expose `backlogQuotaLimiteSize` and > > > > `backlogQuotaLimitTime` > > > > > > of > > > > > > > the topic to PulsarAdmin. > > > > > > > > > > > > > > > > > > > What is the relationship you see between Pulsar exposing > > > > > > subscriptionBacklogSize and earliestMsgPublishTimeInBacklog in > > > > > > subscription, to exposing the backlog quota limits in pulsar > admin? > > > > > > > > > > > > Limits can be exposed to Pulsar Admin, since it has 0 cost > > associated > > > > > with > > > > > > it. > > > > > > I think it's a good idea to do that. > > > > > > The quota usage can also be exposed to pulsar admin, since we > pull > > > that > > > > > > data from the backlog quota checker cache, so it has 0 cost as > > well. > > > > > > > > > > > > As we said in previous email we can also expose > > > > > > `backlogQuotaTimeOldestBacklogAgeSubscriptionName` > > > > > > > > > > > > > > > > > > > > > > > > > > After users receive the backlog alert from metrics alerting > > > systems, > > > > > they > > > > > > > can get the topic name, then, they can request Topics#getStats > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139 > > > > > > > > > > > > > > > to > > > > > > > get which subscriptions are in the huge backlog. > > > > > > > > > > > > > > > > > > > > I agree users can use PulsarAdmin getStats for topic , with > > > > > > getEarliestTimeInBacklog=true to find the oldest subscription > > > > responsible > > > > > > for exceeding quota, but we can give them that information with 0 > > > cost > > > > > > since we already have that subscription name cached (we spent the > > I/O > > > > to > > > > > > find out who that subscription is, let's just cache it and > provide > > > it). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > Tao Jiuming > > > > > > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月1日周三 23:42写道: > > > > > > > > > > > > > > > > > > > > > > > > > Pulsar has 2 configurations for the backlog eviction > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://pulsar.apache.org/docs/2.11.x/cookbooks-retention-expiry/#backlog-quotas > > > > > > > > > > > > > > > > > > : backlogQuotaDefaultLimitBytes and > > > > backlogQuotaDefaultLimitSecond. > > > > > > > > > By default, backlog eviction is disabled, and also, there > is > > a > > > > > field > > > > > > > > named > > > > > > > > > backlogQuotaMap in TopicPolicies > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-common/src/main/java/org/apache/pulsar/common/policies/data/HierarchyTopicPolicies.java#L45 > > > > > > > > > > > > > > > > > > /NamespaceSpacePolicies > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/common/policies/data/Policies.java#L41 > > > > > > > > > > > > > > > > assists > > > > > > > > > in controlling Topic/Namespace level backlog quota. > > > > > > > > > > > > > > > > > > If topic backlog reaches the threshold of any item, backlog > > > > > eviction > > > > > > > will > > > > > > > > > be triggered, Pulsar will move subscription's cursor to > skip > > > > > > > > unacknowledged > > > > > > > > > messages. > > > > > > > > > > > > > > > > > > Before backlog eviction happens, we don't have a metric to > > > > monitor > > > > > > how > > > > > > > > > long that it can reaches the threshold. > > > > > > > > > > > > > > > > > > > > > > > > > I think you should fix this explanation: > > > > > > > > > > > > > > > > In Pulsar, a subscription maintains a state of message > > > > acknowledged. > > > > > A > > > > > > > > subscription backlog is the set of messages which are > > > > unacknowledged. > > > > > > > > A subscription backlog size is the sum of size of > > unacknowledged > > > > > > messages > > > > > > > > (in bytes). > > > > > > > > A topic can have many subscriptions. > > > > > > > > A topic backlog is defined as the backlog size of the > > > subscription > > > > > > which > > > > > > > > has the oldest unacknowledged message. Since acknowledged > > > messages > > > > > can > > > > > > be > > > > > > > > interleaved with unacknowledged messages, calculating the > exact > > > > size > > > > > of > > > > > > > > that subscription can be expensive as it requires I/O > > operations > > > to > > > > > > read > > > > > > > > from the messages from the ledgers. > > > > > > > > For that reason, the topic backlog is actually defined to be > > the > > > > > > > estimated > > > > > > > > backlog size of that subscription. It does so by summarizing > > the > > > > size > > > > > > of > > > > > > > > all the ledgers, starting from the current active one, up to > > the > > > > > ledger > > > > > > > > which contains the oldest unacknowledged message (There is > > > > actually a > > > > > > > > faster way to calculate it, but this is the definition of the > > > > > > > estimation). > > > > > > > > > > > > > > > > A topic backlog age is the age of the oldest unacknowledged > > > message > > > > > (in > > > > > > > any > > > > > > > > subscription). If that message was written 30 minutes ago, > its > > > age > > > > is > > > > > > 30 > > > > > > > > minutes. > > > > > > > > > > > > > > > > Pulsar has a feature called backlog quota (place link). It > > allows > > > > the > > > > > > > user > > > > > > > > to define a quota - in effect, a limit - which limits the > topic > > > > > > backlog. > > > > > > > > There are two types of quotas: > > > > > > > > * Size based: The limit is for the topic backlog size (as we > > > > defined > > > > > > > > above). > > > > > > > > * Time based: The limit is for the topic's backlog age (as we > > > > defined > > > > > > > > above). > > > > > > > > > > > > > > > > Once a topic backlog exceeds either one of those limits, an > > > action > > > > is > > > > > > > taken > > > > > > > > upon messages written to the topic: > > > > > > > > * The producer write is placed on hold for a certain amount > of > > > time > > > > > > > before > > > > > > > > failing. > > > > > > > > * The producer write is failed > > > > > > > > * The subscriptions oldest unacknowledged messages will be > > > > > acknowledged > > > > > > > in > > > > > > > > order until both the topic backlog size or age will fall > inside > > > the > > > > > > limit > > > > > > > > (quota). The process is called backlog eviction (happens > every > > > > > > interval) > > > > > > > > > > > > > > > > The quotas can be defined as a default value for any topic, > by > > > > using > > > > > > the > > > > > > > > following broker configuration keys: > > > backlogQuotaDefaultLimitBytes > > > > , > > > > > > > > backlogQuotaDefaultLimitSecond. It can also be specified > > directly > > > > for > > > > > > all > > > > > > > > topics in a given namespace using the namespace policy, or a > > > > specific > > > > > > > topic > > > > > > > > using a topic policy. > > > > > > > > > > > > > > > > The user today can calculate quota used for size based limit, > > > since > > > > > > there > > > > > > > > are two metrics that are exposed today on a topic level: " > > > > > > > > pulsar_storage_backlog_quota_limit" and > > > > > "pulsar_storage_backlog_size". > > > > > > > You > > > > > > > > can just divide the two to get a percentage. > > > > > > > > For the time-based limit, the only metric exposed today is > > quota > > > > > itself > > > > > > > , " > > > > > > > > pulsar_storage_backlog_quota_limit_time". > > > > > > > > > > > > > > > > ------------ > > > > > > > > > > > > > > > > I would create two metrics: > > > > > > > > > > > > > > > > `pulsar_backlog_size_quota_used_percentage` > > > > > > > > `pulsar_backlog_time_quota_used_percentage` > > > > > > > > > > > > > > > > You would like to know what triggered the alert, hence two. > > > > > > > > It's not the quota percentage, it's the quota used > percentage. > > > > > > > > > > > > > > > > ---------- > > > > > > > > > > > > > > > > It checks if the backlog size exceeds the threshold( > > > > > > > > > backlogQuotaDefaultLimitBytes), and it gets the current > > backlog > > > > > size > > > > > > by > > > > > > > > > calculating LedgerInfo > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/proto/MLDataFormats.proto#L54 > > > > > > > > >, > > > > > > > > > it will not lead to I/O. > > > > > > > > > > > > > > > > This is not correct. > > > > > > > > It checks against the topic / namespace policy, and if it > > doesn't > > > > > > exist, > > > > > > > it > > > > > > > > falls back on the default configuration key mentioned above. > > > > > > > > > > > > > > > > It checks if the backlog time exceeds the threshold( > > > > > > > > > backlogQuotaDefaultLimitSecond). If > > > > > preciseTimeBasedBacklogQuotaCheck > > > > > > > is > > > > > > > > > set to be true, it will read an entry from Bookkeeper, but > > the > > > > > > default > > > > > > > > > value is false, which means it gets the backlog time by > > > > calculating > > > > > > > > > LedgerInfo > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/proto/MLDataFormats.proto#L54 > > > > > > > > >. > > > > > > > > > So in general, we don't need to worry about it will lead to > > > I/O. > > > > > > > > > > > > > > > > > > > > > > > > I'm afraid of that. > > > > > > > > Today the quota is checked periodically, right? So that's how > > the > > > > > > > operator > > > > > > > > knows the cost in terms of I/O is limited. > > > > > > > > Now you are adding one additional I/O per collection, every > 1 > > > min > > > > by > > > > > > > > default. That's a lot perhaps. How long is the check interval > > > > today? > > > > > > > > > > > > > > > > Perhaps in the backlog quota check, you can persist the check > > > > result, > > > > > > and > > > > > > > > use it? Persist the age that is. > > > > > > > > > > > > > > > > > > > > > > > > ------ > > > > > > > > > > > > > > > > Regarding "slowest_subscription" > > > > > > > > I think the cost is too high, because the subscriptions will > > keep > > > > > > > > alternating, which can generate so many unique time series. > > Since > > > > > > > > Prometheus flush only every 2 hours, or any there TSDB, it > will > > > > cost > > > > > > you > > > > > > > > too much. > > > > > > > > > > > > > > > > I suggest exposing the name via the topic stats. This way > they > > > can > > > > > > issue > > > > > > > a > > > > > > > > REST call to grab that subscription name only when the alert > > > fires. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Asaf > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Feb 28, 2023 at 9:29 AM 太上玄元道君 <dao...@apache.org> > > > wrote: > > > > > > > > > > > > > > > > > Hi Asaf, > > > > > > > > > I've updated the PIP, PTAL > > > > > > > > > > > > > > > > > > Thank, > > > > > > > > > Tao Jiuming > > > > > > > > > > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年2月26日周日 23:03写道: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > Pulsar has 2 configurations for the backlog eviction: > > > > > > > > > > > backlogQuotaDefaultLimitBytes and > > > > > backlogQuotaDefaultLimitSecond, > > > > > > > if > > > > > > > > > > > topic backlog reaches the threshold of any item, > backlog > > > > > eviction > > > > > > > > will > > > > > > > > > be > > > > > > > > > > > triggered. > > > > > > > > > > > > > > > > > > > > This seems like default values, not the actual values. > Can > > > you > > > > > > please > > > > > > > > > > provide an explanation in the PIP and link to read more: > > > > > > > > > > 1. Where do you define the backlog quota exactly? What is > > the > > > > > > > > granularity > > > > > > > > > > (subscription?) > > > > > > > > > > 2. Is the backlog quota on by default? If so, what are > the > > > > > default > > > > > > > > > values? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Notes* > > > > > > > > > > 1. When the backlog quota limit is defined in Bytes, and > > you > > > > wish > > > > > > to > > > > > > > > know > > > > > > > > > > how close a subscription is to its bytes limit, you need > to > > > > > > calculate > > > > > > > > the > > > > > > > > > > backlog size in bytes. From my understanding, there is an > > > > > accurate > > > > > > > > > > calculation (which is costly in terms of I/O) and there > is > > an > > > > > > > estimate > > > > > > > > of > > > > > > > > > > it. I presume you would want to use the estimated one, is > > > that > > > > > > > correct? > > > > > > > > > > The backlog quota itself, uses the accurate or the > > estimated > > > > when > > > > > > it > > > > > > > > > starts > > > > > > > > > > evicting entries (i.e. marking them as acknowledged)? > > > > > > > > > > > > > > > > > > > > 2. For the backlog limit specifying in time units, there > is > > > no > > > > > > > > estimate, > > > > > > > > > as > > > > > > > > > > it must be calculated all the time (earliest > unacknowledged > > > > > message > > > > > > > > > > distance from now). How do you plan to calculate the > > current > > > > age > > > > > of > > > > > > > the > > > > > > > > > > earliest message without bearing that I/O cost on each > > metric > > > > > > > > > calculation? > > > > > > > > > > > > > > > > > > > > 3. In the Goal section, you specify that your goal is to > > add > > > a > > > > > > > > > "proximity" > > > > > > > > > > metric. > > > > > > > > > > a) You must define that - what is proximity metric > exactly? > > > > What > > > > > > are > > > > > > > > its > > > > > > > > > > units? How are you planning to calculate it? > > > > > > > > > > b) Proximity is not a good term IMO. I personally have > > never > > > > seen > > > > > > > this > > > > > > > > > term > > > > > > > > > > used in software systems, unless it's in the > aviation/space > > > > > > industry. > > > > > > > > > Once > > > > > > > > > > you explain (a) I hope I can help provide alternative > > names. > > > > > > > > > > > > > > > > > > > > 4. Maybe we should provide the used quota percentage for > > both > > > > > > limits, > > > > > > > > > > instead of one per both, since it's easier to act upon > the > > > > alert > > > > > > when > > > > > > > > you > > > > > > > > > > need which one triggered it. > > > > > > > > > > > > > > > > > > > > 5. I didn't understand the "slowest_subscription" label > > used > > > > when > > > > > > > > > > describing the metric label. Can you please provide an > > > > > explanation? > > > > > > > > > > > > > > > > > > > > 6. I suggest writing a "High Level Design" section, and > add > > > > > > > everything > > > > > > > > > you > > > > > > > > > > need to know for this proposal, so I don't need to read > the > > > > > > > > > > implementation details below (code). > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > Asaf > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Feb 22, 2023 at 4:52 PM 太上玄元道君 < > dao...@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > I've started a PIP to discuss: PIP-248 Add backlog > > eviction > > > > > > metric > > > > > > > > > > > > > > > > > > > > > > ### Motivation: > > > > > > > > > > > > > > > > > > > > > > Pulsar has 2 configurations for the backlog eviction: > > > > > > > > > > > `backlogQuotaDefaultLimitBytes` and > > > > > > > `backlogQuotaDefaultLimitSecond`, > > > > > > > > > if > > > > > > > > > > > topic backlog reaches the threshold of any item, > backlog > > > > > eviction > > > > > > > > will > > > > > > > > > be > > > > > > > > > > > triggered. > > > > > > > > > > > > > > > > > > > > > > Before backlog eviction happens, we don't have a metric > > to > > > > > > monitor > > > > > > > > how > > > > > > > > > > long > > > > > > > > > > > that it can reaches the threshold. > > > > > > > > > > > > > > > > > > > > > > We can provide a progress bar metric to tell users some > > > > topics > > > > > is > > > > > > > > about > > > > > > > > > > to > > > > > > > > > > > trigger backlog eviction. And users can subscribe the > > alert > > > > to > > > > > > > > schedule > > > > > > > > > > > consumers. > > > > > > > > > > > > > > > > > > > > > > For more details, please read the PIP at > > > > > > > > > > > https://github.com/apache/pulsar/issues/19601 > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Tao Jiuming > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >