> Need to replace (place link) with link. I replaced the `Motivation` with your advice.
> We discussed adding the subscription name which triggered the time limit to > Topics.getStats(). > Why? Since we have `pulsar_storage_backlog_eviction_count`, I think we don't need to expose the subscription name which triggered the backlog eviction. > I have to run getStats(getEarliestTimeInBacklog=true) and it's way more > expensive than the proposal above, since it needs to reach the earliest > message for *each* subscription. I don't think we need to save these expenses, it is only triggered when the user requests. If the user does not set `getEarliestTimeInBacklog` to true, there will be no such overhead. We don't need to add complexity for very few calls > Also a bit less accurate - you want to get the subscription cached that > triggered it, using the same number to find it. Earliest backlog is > accurate but if the configuration flag is off, it's not the same number as > getStats. Such problems do exist. Maybe there are many backlogs when the user receives the alert, but the backlogs have been reduced when the endpoint(Topics#getStats) is requested. There is a time difference between them. However, when the user receives an alarm, it is only a notification. When the user requests the endpoint, they may take action. I think it is reasonable to provide users with a more accurate backlog before they act. Thanks, Tao Jiuming Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月14日周二 16:51写道: > > > > Pulsar has a feature called backlog quota (place link) > > Need to replace (place link) with link. > > > > > 1. Find the backlog subscriptions > > After received the alarm, users could request > Topics#getStats(topicName, > > true/false, true, true) > > < > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139> > to > > get the topic stats, and find which subscriptions are in backlog. > > Pulsar exposed backlogSize and earliestMsgPublishTimeInBacklog in the > > subscription level, and we will expose backlogQuotaSizeBytes and > > backlogQuotaTimeSeconds in the topic level, so users could find which > > subscriptions in backlog easily. > > > > We have forgotten the other comment. > We discussed adding the subscription name which triggered the time limit to > Topics.getStats(). > Why? > > I have to run getStats(getEarliestTimeInBacklog=true) and it's way more > expensive than the proposal above, since it needs to reach the earliest > message for *each* subscription. > Also a bit less accurate - you want to get the subscription cached that > triggered it, using the same number to find it. Earliest backlog is > accurate but if the configuration flag is off, it's not the same number as > getStats. > > > Nice to have (not mandatory) additions: > > I would add before > > > > > 1. After readEntryComplete > > < > https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L2780 > >, > > cache its result: > > > > When this configuration flag is set to true, the broker does an I/O call > by reading the oldest entry to get its write timestamp. Once we have that, > we'll add caching to that value since we're going to use it for returning > the age. > > I would add before: > > > slowestReaderTimeBasedBacklogQuotaCheck > > < > https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentTopic.java#L2817> > is > > a totally in-memory method, we just need to cache the > > > > When this configuration flag is set to false, the check uses an estimate of > the oldest entry timestamp, by taking the closing time of the ledger which > the message is contained at. > > On Fri, Mar 10, 2023 at 8:29 AM 太上玄元道君 <dao...@apache.org> wrote: > > > I think yes, to avoid missing something, you can take a look if you have > > time. > > > > Thanks, > > Tao Jiuming > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月9日周四 17:40写道: > > > > > Is the PIP updated with all comments? > > > > > > On Thu, Mar 9, 2023 at 8:59 AM 太上玄元道君 <dao...@apache.org> wrote: > > > > > > > > backlogQuotaLimitSize > > > > > should be `backlogQuotaSizeBytes` > > > > > > > > > backlogQuotaLimitTime > > > > > should be `backlogQuotaTimeSeconds` > > > > > > > > > So you need to rename the metric. > > > > > "pulsar_storage_backlog_quota_count" --> > > > > > `pulsar_storage_backlog_eviction_count` > > > > > > > > > the topic's existing subscription. > > > > > "subscription" --> "subscription*s*" > > > > > > > > > Number of backlog quota happends. > > > > > Number of times backlog evictions happened due to exceeding backlog > > > quota > > > > > (either time or size). > > > > > > > > Accepted, if there is no more need to change, I'll start the vote > next > > > > week. > > > > > > > > Thanks, > > > > Tao Jiuming > > > > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月7日周二 00:02写道: > > > > > > > > > > > > > > > > Pulsar has a feature called backlog quota (place link). > > > > > > > > > > You need to place a link :) > > > > > > > > > > Expose pulsar_storage_backlog_quota_count in the topic leve > > > > > > > > > > You already have "pulsar_storage_backlog_size", so why do you need > > this > > > > > metric for? > > > > > > > > > > backlogQuotaLimitSize > > > > > > > > > > should be `backlogQuotaSizeBytes` > > > > > > > > > > backlogQuotaLimitTime > > > > > > > > > > should be `backlogQuotaTimeSeconds` > > > > > > > > > > What about goal no.4? Expose oldest unacknowledged message > > subscription > > > > > name? > > > > > > > > > > IMO, metrics are like API - perhaps indicate the change there as > well > > > > > > > > > > Record the event when dropBacklogForSizeLimit > > > > > > < > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BacklogQuotaManager.java#L121 > > > > > > > > > > > > or dropBacklogForTimeLimit > > > > > > < > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BacklogQuotaManager.java#L194 > > > > > > > > > > is > > > > > > going to invoked. > > > > > > > > > > > > > > > Oh, now I get it. > > > > > So you need to rename the metric. > > > > > "pulsar_storage_backlog_quota_count" --> > > > > > `pulsar_storage_backlog_eviction_count` > > > > > > > > > > > > > > > > the topic's existing subscription. > > > > > > > > > > "subscription" --> "subscription*s*" > > > > > > > > > > Number of backlog quota happends. > > > > > > > > > > Number of times backlog evictions happened due to exceeding backlog > > > quota > > > > > (either time or size). > > > > > > > > > > > > > > > > 1. Find the backlog subscriptions > > > > > > After received the alarm, users could request > > > > > Topics#getStats(topicName, > > > > > > true/false, true, true) > > > > > > < > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139 > > > > > > > > > > to > > > > > > get the topic stats, and find which subscriptions are in > > backlog. > > > > > > Pulsar exposed backlogSize and earliestMsgPublishTimeInBacklog > > in > > > > the > > > > > > subscription level, and we will expose backlogQuotaLimitSize > and > > > > > > backlogQuotaLimitTime in the topic level, so users could find > > > which > > > > > > subscriptions in backlog easily. > > > > > > > > > > > > I wrote how it should be done IMO in a previous email. > > > > > > > > > > > > > > > On Mon, Mar 6, 2023 at 1:20 PM 太上玄元道君 <dao...@apache.org> wrote: > > > > > > > > > > > Hi Aasf, > > > > > > I've updated the PIP, PTAL > > > > > > > > > > > > Thanks, > > > > > > Tao Jiuming > > > > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月5日周日 21:00写道: > > > > > > > > > > > > > On Thu, Mar 2, 2023 at 12:57 PM 太上玄元道君 <dao...@apache.org> > > wrote: > > > > > > > > > > > > > > > > I think you should fix this explanation: > > > > > > > > > > > > > > > > Thanks! I would like to copy the context you provide to the > PIP > > > > > > > motivation, > > > > > > > > your description is more detailed, so developers don't have > to > > go > > > > > > through > > > > > > > > the code. > > > > > > > > > > > > > > > > > > > > > > Sure > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Today the quota is checked periodically, right? So that's > how > > > the > > > > > > > > operator > > > > > > > > > knows the cost in terms of I/O is limited. > > > > > > > > > Now you are adding one additional I/O per collection, > every 1 > > > min > > > > > by > > > > > > > > > default. That's a lot perhaps. How long is the check > interval > > > > > today? > > > > > > > > > > > > > > > > Actually, I don't want to introduce additional costs, I > thought > > > we > > > > > > > > could cache its result, so that it won't introduce additional > > > > costs. > > > > > > > > It may be that I did not make it clear in the PIP and caused > > this > > > > > > > > misunderstanding, sorry. > > > > > > > > > > > > > > > > > > > > > > Ok, just to verify: You plan to modify the code that runs > > > > periodically > > > > > > the > > > > > > > backlog quota check, so the result will be cached there? This > way > > > > when > > > > > > you > > > > > > > pull that information from that code every 1min to expose it > as a > > > > > metric > > > > > > it > > > > > > > will have 0 I/O cost? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The user today can calculate quota used for size based > limit, > > > > since > > > > > > > there > > > > > > > > > are two metrics that are exposed today on a topic level: " > > > > > > > > > pulsar_storage_backlog_quota_limit" and > > > > > > "pulsar_storage_backlog_size". > > > > > > > > You > > > > > > > > > can just divide the two to get a percentage. > > > > > > > > > For the time-based limit, the only metric exposed today is > > > quota > > > > > > > itself , > > > > > > > > " > > > > > > > > > pulsar_storage_backlog_quota_limit_time". > > > > > > > > > > > > > > > > I only noticed `pulsar_storage_backlog_size` but missed > > > > > > > > `pulsar_storage_backlog_quota_limit` and > > > > > > > > `pulsar_storage_backlog_quota_limit_time`. Many thanks for > your > > > > > > reminder. > > > > > > > > > > > > > > > > > > > > > > > > So, in this condition, we already have the following > > topic-level > > > > > > metrics: > > > > > > > > `pulsar_storage_backlog_size`: The total backlog size of the > > > topics > > > > > of > > > > > > > this > > > > > > > > topic owned by this broker (in bytes). > > > > > > > > `pulsar_storage_backlog_quota_limit`: The total amount of the > > > data > > > > in > > > > > > > this > > > > > > > > topic that limits the backlog quota (bytes). > > > > > > > > `pulsar_storage_backlog_quota_limit_time`: The backlog quota > > > limit > > > > in > > > > > > > > time(seconds). (This metric does not exists in the doc, need > to > > > > > > improve) > > > > > > > > > > > > > > > > > > > > > > > > We just need to add a new metric named > > > > > > > > `pulsar_storage_earliest_msg_publish_time_in_backlog` in the > > > > > > topic-level > > > > > > > > that indicates the publish time of the earliest message in > the > > > > > backlog. > > > > > > > > So users could get > `pulsar_backlog_size_quota_used_percentage` > > by > > > > > > divide > > > > > > > > `pulsar_storage_backlog_size ` and > > > > > > > > > > > `pulsar_storage_backlog_quota_limit`(`pulsar_storage_backlog_size` > > > > / > > > > > > > > `pulsar_storage_backlog_quota_limit`), > > > > > > > > and could get `pulsar_backlog_time_quota_used_percentage` by > > > divide > > > > > > `now > > > > > > > - > > > > > > > > pulsar_storage_earliest_msg_publish_time_in_backlog` and > > > > > > > > `pulsar_storage_backlog_quota_limit_time` (`now - > > > > > > > > pulsar_storage_earliest_msg_publish_time_in_backlog` / > > > > > > > > `pulsar_storage_backlog_quota_limit_time`). > > > > > > > > > > > > > > > > > > > > > > I think there is a problem with the name > > > > > > > `pulsar_storage_earliest_msg_publish_time_in_backlog` in the > > > > > topic-level: > > > > > > > * First, I prefer exposing the age rather than the publish > time. > > > > > > > * Second, it's a bit hard to figure out the meaning of the > > earliest > > > > msg > > > > > > in > > > > > > > the backlog. > > > > > > > > > > > > > > Maybe `pulsar_storage_backlog_age_seconds`? In the explanation > > you > > > > can > > > > > > > write: "The age (time passed since it was published) of the > > > earliest > > > > > > > unacknowledged message based on the topic's > > > > > > > existing subscriptions" ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The backlog quota time checker runs periodically, so we can > > cache > > > > its > > > > > > > > result, so it won't lead to much costs. > > > > > > > > > > > > > > > > Pulsar also exposed subscription-level `backlogSize` and > > > > > > > > `earliestMsgPublishTimeInBacklog` in Pulsar-Admin > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139 > > > > > > > > > > > > > > > > > if > > > > > > > > `subscriptionBacklogSize` and `getEarliestTimeInBacklog` are > > > true. > > > > > > > > We can also expose `backlogQuotaLimiteSize` and > > > > > `backlogQuotaLimitTime` > > > > > > > of > > > > > > > > the topic to PulsarAdmin. > > > > > > > > > > > > > > > > > > > > > > What is the relationship you see between Pulsar exposing > > > > > > > subscriptionBacklogSize and earliestMsgPublishTimeInBacklog in > > > > > > > subscription, to exposing the backlog quota limits in pulsar > > admin? > > > > > > > > > > > > > > Limits can be exposed to Pulsar Admin, since it has 0 cost > > > associated > > > > > > with > > > > > > > it. > > > > > > > I think it's a good idea to do that. > > > > > > > The quota usage can also be exposed to pulsar admin, since we > > pull > > > > that > > > > > > > data from the backlog quota checker cache, so it has 0 cost as > > > well. > > > > > > > > > > > > > > As we said in previous email we can also expose > > > > > > > `backlogQuotaTimeOldestBacklogAgeSubscriptionName` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > After users receive the backlog alert from metrics alerting > > > > systems, > > > > > > they > > > > > > > > can get the topic name, then, they can request > Topics#getStats > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1139 > > > > > > > > > > > > > > > > > to > > > > > > > > get which subscriptions are in the huge backlog. > > > > > > > > > > > > > > > > > > > > > > > I agree users can use PulsarAdmin getStats for topic , with > > > > > > > getEarliestTimeInBacklog=true to find the oldest subscription > > > > > responsible > > > > > > > for exceeding quota, but we can give them that information > with 0 > > > > cost > > > > > > > since we already have that subscription name cached (we spent > the > > > I/O > > > > > to > > > > > > > find out who that subscription is, let's just cache it and > > provide > > > > it). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Tao Jiuming > > > > > > > > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年3月1日周三 23:42写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Pulsar has 2 configurations for the backlog eviction > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://pulsar.apache.org/docs/2.11.x/cookbooks-retention-expiry/#backlog-quotas > > > > > > > > > > > > > > > > > > > > : backlogQuotaDefaultLimitBytes and > > > > > backlogQuotaDefaultLimitSecond. > > > > > > > > > > By default, backlog eviction is disabled, and also, there > > is > > > a > > > > > > field > > > > > > > > > named > > > > > > > > > > backlogQuotaMap in TopicPolicies > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-common/src/main/java/org/apache/pulsar/common/policies/data/HierarchyTopicPolicies.java#L45 > > > > > > > > > > > > > > > > > > > > /NamespaceSpacePolicies > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/pulsar-client-admin-api/src/main/java/org/apache/pulsar/common/policies/data/Policies.java#L41 > > > > > > > > > > > > > > > > > > assists > > > > > > > > > > in controlling Topic/Namespace level backlog quota. > > > > > > > > > > > > > > > > > > > > If topic backlog reaches the threshold of any item, > backlog > > > > > > eviction > > > > > > > > will > > > > > > > > > > be triggered, Pulsar will move subscription's cursor to > > skip > > > > > > > > > unacknowledged > > > > > > > > > > messages. > > > > > > > > > > > > > > > > > > > > Before backlog eviction happens, we don't have a metric > to > > > > > monitor > > > > > > > how > > > > > > > > > > long that it can reaches the threshold. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think you should fix this explanation: > > > > > > > > > > > > > > > > > > In Pulsar, a subscription maintains a state of message > > > > > acknowledged. > > > > > > A > > > > > > > > > subscription backlog is the set of messages which are > > > > > unacknowledged. > > > > > > > > > A subscription backlog size is the sum of size of > > > unacknowledged > > > > > > > messages > > > > > > > > > (in bytes). > > > > > > > > > A topic can have many subscriptions. > > > > > > > > > A topic backlog is defined as the backlog size of the > > > > subscription > > > > > > > which > > > > > > > > > has the oldest unacknowledged message. Since acknowledged > > > > messages > > > > > > can > > > > > > > be > > > > > > > > > interleaved with unacknowledged messages, calculating the > > exact > > > > > size > > > > > > of > > > > > > > > > that subscription can be expensive as it requires I/O > > > operations > > > > to > > > > > > > read > > > > > > > > > from the messages from the ledgers. > > > > > > > > > For that reason, the topic backlog is actually defined to > be > > > the > > > > > > > > estimated > > > > > > > > > backlog size of that subscription. It does so by > summarizing > > > the > > > > > size > > > > > > > of > > > > > > > > > all the ledgers, starting from the current active one, up > to > > > the > > > > > > ledger > > > > > > > > > which contains the oldest unacknowledged message (There is > > > > > actually a > > > > > > > > > faster way to calculate it, but this is the definition of > the > > > > > > > > estimation). > > > > > > > > > > > > > > > > > > A topic backlog age is the age of the oldest unacknowledged > > > > message > > > > > > (in > > > > > > > > any > > > > > > > > > subscription). If that message was written 30 minutes ago, > > its > > > > age > > > > > is > > > > > > > 30 > > > > > > > > > minutes. > > > > > > > > > > > > > > > > > > Pulsar has a feature called backlog quota (place link). It > > > allows > > > > > the > > > > > > > > user > > > > > > > > > to define a quota - in effect, a limit - which limits the > > topic > > > > > > > backlog. > > > > > > > > > There are two types of quotas: > > > > > > > > > * Size based: The limit is for the topic backlog size (as > we > > > > > defined > > > > > > > > > above). > > > > > > > > > * Time based: The limit is for the topic's backlog age (as > we > > > > > defined > > > > > > > > > above). > > > > > > > > > > > > > > > > > > Once a topic backlog exceeds either one of those limits, an > > > > action > > > > > is > > > > > > > > taken > > > > > > > > > upon messages written to the topic: > > > > > > > > > * The producer write is placed on hold for a certain amount > > of > > > > time > > > > > > > > before > > > > > > > > > failing. > > > > > > > > > * The producer write is failed > > > > > > > > > * The subscriptions oldest unacknowledged messages will be > > > > > > acknowledged > > > > > > > > in > > > > > > > > > order until both the topic backlog size or age will fall > > inside > > > > the > > > > > > > limit > > > > > > > > > (quota). The process is called backlog eviction (happens > > every > > > > > > > interval) > > > > > > > > > > > > > > > > > > The quotas can be defined as a default value for any topic, > > by > > > > > using > > > > > > > the > > > > > > > > > following broker configuration keys: > > > > backlogQuotaDefaultLimitBytes > > > > > , > > > > > > > > > backlogQuotaDefaultLimitSecond. It can also be specified > > > directly > > > > > for > > > > > > > all > > > > > > > > > topics in a given namespace using the namespace policy, or > a > > > > > specific > > > > > > > > topic > > > > > > > > > using a topic policy. > > > > > > > > > > > > > > > > > > The user today can calculate quota used for size based > limit, > > > > since > > > > > > > there > > > > > > > > > are two metrics that are exposed today on a topic level: " > > > > > > > > > pulsar_storage_backlog_quota_limit" and > > > > > > "pulsar_storage_backlog_size". > > > > > > > > You > > > > > > > > > can just divide the two to get a percentage. > > > > > > > > > For the time-based limit, the only metric exposed today is > > > quota > > > > > > itself > > > > > > > > , " > > > > > > > > > pulsar_storage_backlog_quota_limit_time". > > > > > > > > > > > > > > > > > > ------------ > > > > > > > > > > > > > > > > > > I would create two metrics: > > > > > > > > > > > > > > > > > > `pulsar_backlog_size_quota_used_percentage` > > > > > > > > > `pulsar_backlog_time_quota_used_percentage` > > > > > > > > > > > > > > > > > > You would like to know what triggered the alert, hence two. > > > > > > > > > It's not the quota percentage, it's the quota used > > percentage. > > > > > > > > > > > > > > > > > > ---------- > > > > > > > > > > > > > > > > > > It checks if the backlog size exceeds the threshold( > > > > > > > > > > backlogQuotaDefaultLimitBytes), and it gets the current > > > backlog > > > > > > size > > > > > > > by > > > > > > > > > > calculating LedgerInfo > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/proto/MLDataFormats.proto#L54 > > > > > > > > > >, > > > > > > > > > > it will not lead to I/O. > > > > > > > > > > > > > > > > > > This is not correct. > > > > > > > > > It checks against the topic / namespace policy, and if it > > > doesn't > > > > > > > exist, > > > > > > > > it > > > > > > > > > falls back on the default configuration key mentioned > above. > > > > > > > > > > > > > > > > > > It checks if the backlog time exceeds the threshold( > > > > > > > > > > backlogQuotaDefaultLimitSecond). If > > > > > > preciseTimeBasedBacklogQuotaCheck > > > > > > > > is > > > > > > > > > > set to be true, it will read an entry from Bookkeeper, > but > > > the > > > > > > > default > > > > > > > > > > value is false, which means it gets the backlog time by > > > > > calculating > > > > > > > > > > LedgerInfo > > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/proto/MLDataFormats.proto#L54 > > > > > > > > > >. > > > > > > > > > > So in general, we don't need to worry about it will lead > to > > > > I/O. > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm afraid of that. > > > > > > > > > Today the quota is checked periodically, right? So that's > how > > > the > > > > > > > > operator > > > > > > > > > knows the cost in terms of I/O is limited. > > > > > > > > > Now you are adding one additional I/O per collection, > every > > 1 > > > > min > > > > > by > > > > > > > > > default. That's a lot perhaps. How long is the check > interval > > > > > today? > > > > > > > > > > > > > > > > > > Perhaps in the backlog quota check, you can persist the > check > > > > > result, > > > > > > > and > > > > > > > > > use it? Persist the age that is. > > > > > > > > > > > > > > > > > > > > > > > > > > > ------ > > > > > > > > > > > > > > > > > > Regarding "slowest_subscription" > > > > > > > > > I think the cost is too high, because the subscriptions > will > > > keep > > > > > > > > > alternating, which can generate so many unique time series. > > > Since > > > > > > > > > Prometheus flush only every 2 hours, or any there TSDB, it > > will > > > > > cost > > > > > > > you > > > > > > > > > too much. > > > > > > > > > > > > > > > > > > I suggest exposing the name via the topic stats. This way > > they > > > > can > > > > > > > issue > > > > > > > > a > > > > > > > > > REST call to grab that subscription name only when the > alert > > > > fires. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > Asaf > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Feb 28, 2023 at 9:29 AM 太上玄元道君 <dao...@apache.org> > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Asaf, > > > > > > > > > > I've updated the PIP, PTAL > > > > > > > > > > > > > > > > > > > > Thank, > > > > > > > > > > Tao Jiuming > > > > > > > > > > > > > > > > > > > > Asaf Mesika <asaf.mes...@gmail.com> 于2023年2月26日周日 > 23:03写道: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > Pulsar has 2 configurations for the backlog eviction: > > > > > > > > > > > > backlogQuotaDefaultLimitBytes and > > > > > > backlogQuotaDefaultLimitSecond, > > > > > > > > if > > > > > > > > > > > > topic backlog reaches the threshold of any item, > > backlog > > > > > > eviction > > > > > > > > > will > > > > > > > > > > be > > > > > > > > > > > > triggered. > > > > > > > > > > > > > > > > > > > > > > This seems like default values, not the actual values. > > Can > > > > you > > > > > > > please > > > > > > > > > > > provide an explanation in the PIP and link to read > more: > > > > > > > > > > > 1. Where do you define the backlog quota exactly? What > is > > > the > > > > > > > > > granularity > > > > > > > > > > > (subscription?) > > > > > > > > > > > 2. Is the backlog quota on by default? If so, what are > > the > > > > > > default > > > > > > > > > > values? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Notes* > > > > > > > > > > > 1. When the backlog quota limit is defined in Bytes, > and > > > you > > > > > wish > > > > > > > to > > > > > > > > > know > > > > > > > > > > > how close a subscription is to its bytes limit, you > need > > to > > > > > > > calculate > > > > > > > > > the > > > > > > > > > > > backlog size in bytes. From my understanding, there is > an > > > > > > accurate > > > > > > > > > > > calculation (which is costly in terms of I/O) and there > > is > > > an > > > > > > > > estimate > > > > > > > > > of > > > > > > > > > > > it. I presume you would want to use the estimated one, > is > > > > that > > > > > > > > correct? > > > > > > > > > > > The backlog quota itself, uses the accurate or the > > > estimated > > > > > when > > > > > > > it > > > > > > > > > > starts > > > > > > > > > > > evicting entries (i.e. marking them as acknowledged)? > > > > > > > > > > > > > > > > > > > > > > 2. For the backlog limit specifying in time units, > there > > is > > > > no > > > > > > > > > estimate, > > > > > > > > > > as > > > > > > > > > > > it must be calculated all the time (earliest > > unacknowledged > > > > > > message > > > > > > > > > > > distance from now). How do you plan to calculate the > > > current > > > > > age > > > > > > of > > > > > > > > the > > > > > > > > > > > earliest message without bearing that I/O cost on each > > > metric > > > > > > > > > > calculation? > > > > > > > > > > > > > > > > > > > > > > 3. In the Goal section, you specify that your goal is > to > > > add > > > > a > > > > > > > > > > "proximity" > > > > > > > > > > > metric. > > > > > > > > > > > a) You must define that - what is proximity metric > > exactly? > > > > > What > > > > > > > are > > > > > > > > > its > > > > > > > > > > > units? How are you planning to calculate it? > > > > > > > > > > > b) Proximity is not a good term IMO. I personally have > > > never > > > > > seen > > > > > > > > this > > > > > > > > > > term > > > > > > > > > > > used in software systems, unless it's in the > > aviation/space > > > > > > > industry. > > > > > > > > > > Once > > > > > > > > > > > you explain (a) I hope I can help provide alternative > > > names. > > > > > > > > > > > > > > > > > > > > > > 4. Maybe we should provide the used quota percentage > for > > > both > > > > > > > limits, > > > > > > > > > > > instead of one per both, since it's easier to act upon > > the > > > > > alert > > > > > > > when > > > > > > > > > you > > > > > > > > > > > need which one triggered it. > > > > > > > > > > > > > > > > > > > > > > 5. I didn't understand the "slowest_subscription" label > > > used > > > > > when > > > > > > > > > > > describing the metric label. Can you please provide an > > > > > > explanation? > > > > > > > > > > > > > > > > > > > > > > 6. I suggest writing a "High Level Design" section, and > > add > > > > > > > > everything > > > > > > > > > > you > > > > > > > > > > > need to know for this proposal, so I don't need to read > > the > > > > > > > > > > > implementation details below (code). > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > Asaf > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Feb 22, 2023 at 4:52 PM 太上玄元道君 < > > dao...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > I've started a PIP to discuss: PIP-248 Add backlog > > > eviction > > > > > > > metric > > > > > > > > > > > > > > > > > > > > > > > > ### Motivation: > > > > > > > > > > > > > > > > > > > > > > > > Pulsar has 2 configurations for the backlog eviction: > > > > > > > > > > > > `backlogQuotaDefaultLimitBytes` and > > > > > > > > `backlogQuotaDefaultLimitSecond`, > > > > > > > > > > if > > > > > > > > > > > > topic backlog reaches the threshold of any item, > > backlog > > > > > > eviction > > > > > > > > > will > > > > > > > > > > be > > > > > > > > > > > > triggered. > > > > > > > > > > > > > > > > > > > > > > > > Before backlog eviction happens, we don't have a > metric > > > to > > > > > > > monitor > > > > > > > > > how > > > > > > > > > > > long > > > > > > > > > > > > that it can reaches the threshold. > > > > > > > > > > > > > > > > > > > > > > > > We can provide a progress bar metric to tell users > some > > > > > topics > > > > > > is > > > > > > > > > about > > > > > > > > > > > to > > > > > > > > > > > > trigger backlog eviction. And users can subscribe the > > > alert > > > > > to > > > > > > > > > schedule > > > > > > > > > > > > consumers. > > > > > > > > > > > > > > > > > > > > > > > > For more details, please read the PIP at > > > > > > > > > > > > https://github.com/apache/pulsar/issues/19601 > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Tao Jiuming > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >