Hi, I looked at this again today and I was wondering how if would compare against the lag reported by consumers. Could we use the kafka.consumer:type=consumer-fetch-manager-metrics,partition="{partition}",topic="{topic}",client-id="{client-id}" metric that already exists? If not, let's add this to the rejected alternatives and explain why.
Thanks, Mickael On Mon, Jul 29, 2024 at 4:26 PM Mickael Maison <mickael.mai...@gmail.com> wrote: > > Hi, > > > My thinking is that any partition could go stale if there are no records > being produced into it. > > 1. Why would the value become stale if there are no new records? The > lag should stay the same, no? > > > If enough of such partitions are present and are owned by a single MM task, > > an OOM could happen. > > 2. We already have a dozen of metrics per partition [0]. Why do you > think adding a few more would cause OutOfMemory errors? > Each task should only emit metrics for partitions it owns. > > > Regarding the scenario where the TTL value is lower than the refresh > > interval - I believe that this is an edge that we need to document and > > prevent against, for example either failing to start on such a combination > > or resorting to a default value that would satisfy the constraint and > > logging an error. > > 3. Can you add the behavior you propose in the KIP? > > Thanks, > Mickael > > 0: > https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceMetrics.java#L71-L106 > > On Wed, May 22, 2024 at 9:18 PM Elxan Eminov <elxanemino...@gmail.com> wrote: > > > > Hey Mickael, > > Just checking to see if you have any thoughts on this. > > thanks! > > > > On Thu, 11 Apr 2024 at 15:11, Elxan Eminov <elxanemino...@gmail.com> wrote: > > > > > Hi Mickael! > > > Any thoughts on this? > > > Thanks! > > > > > > On Wed, 3 Apr 2024 at 13:21, Elxan Eminov <elxanemino...@gmail.com> wrote: > > > > > >> Hi Mickael, > > >> Thanks for your response and apologies for a huge delay in mine. > > >> > > >> My thinking is that any partition could go stale if there are no records > > >> being produced into it. If enough of such partitions are present and are > > >> owned by a single MM task, an OOM could happen. > > >> > > >> Regarding the scenario where the TTL value is lower than the refresh > > >> interval - I believe that this is an edge that we need to document and > > >> prevent against, for example either failing to start on such a > > >> combination > > >> or resorting to a default value that would satisfy the constraint and > > >> logging an error. > > >> > > >> Thanks, > > >> Elkhan > > >> > > >> On Thu, 8 Feb 2024 at 14:17, Mickael Maison <mickael.mai...@gmail.com> > > >> wrote: > > >> > > >>> Hi, > > >>> > > >>> Thanks for the updates. > > >>> I'm wondering whether we really need the ttl eviction mechanism. The > > >>> motivation is to "avoid storing stale LRO entries which can cause an > > >>> eventual OOM error". How could it contain stake entries? I would > > >>> expect its cache to only contain entries for partitions assigned to > > >>> the task that owns it. Also what is the expected behavior if there's > > >>> no available LRO in the cache? If we keep this mechanism what happens > > >>> if its value is lower than > > >>> replication.record.lag.metric.refresh.interval? > > >>> > > >>> Thanks, > > >>> Mickael > > >>> > > >>> On Mon, Feb 5, 2024 at 5:23 PM Elxan Eminov <elxanemino...@gmail.com> > > >>> wrote: > > >>> > > > >>> > Hi Mickael! > > >>> > Any further thoughts on this? > > >>> > > > >>> > Thanks, > > >>> > Elkhan > > >>> > > > >>> > On Thu, 18 Jan 2024 at 11:53, Mickael Maison <mickael.mai...@gmail.com > > >>> > > > >>> > wrote: > > >>> > > > >>> > > Hi Elxan, > > >>> > > > > >>> > > Thanks for the updates. > > >>> > > > > >>> > > We used dots to separate words in configuration names, so I think > > >>> > > replication.offset.lag.metric.last-replicated-offset.ttl should be > > >>> > > named replication.offset.lag.metric.last.replicated.offset.ttl > > >>> > > instead. > > >>> > > > > >>> > > About the names of the metrics, fair enough if you prefer keeping > > >>> > > the > > >>> > > replication prefix. Out of the alternatives you mentioned, I think I > > >>> > > prefer replication-record-lag. I think the metrics and configuration > > >>> > > names should match too. Let's see what the others think about it. > > >>> > > > > >>> > > Thanks, > > >>> > > Mickael > > >>> > > > > >>> > > On Mon, Jan 15, 2024 at 9:50 PM Elxan Eminov < > > >>> elxanemino...@gmail.com> > > >>> > > wrote: > > >>> > > > > > >>> > > > Apologies, forgot to reply on your last comment about the metric > > >>> name. > > >>> > > > I believe both replication-lag and record-lag are a little too > > >>> abstract - > > >>> > > > what do you think about either leaving it as > > >>> replication-offset-lag or > > >>> > > > renaming to replication-record-lag? > > >>> > > > > > >>> > > > Thanks > > >>> > > > > > >>> > > > On Wed, 10 Jan 2024 at 15:31, Mickael Maison < > > >>> mickael.mai...@gmail.com> > > >>> > > > wrote: > > >>> > > > > > >>> > > > > Hi Elxan, > > >>> > > > > > > >>> > > > > Thanks for the KIP, it looks like a useful addition. > > >>> > > > > > > >>> > > > > Can you add to the KIP the default value you propose for > > >>> > > > > replication.lag.metric.refresh.interval? In MirrorMaker most > > >>> interval > > >>> > > > > configs can be set to -1 to disable them, will it be the case > > >>> for this > > >>> > > > > new feature or will this setting only accept positive values? > > >>> > > > > I also wonder if replication-lag, or record-lag would be clearer > > >>> names > > >>> > > > > instead of replication-offset-lag, WDYT? > > >>> > > > > > > >>> > > > > Thanks, > > >>> > > > > Mickael > > >>> > > > > > > >>> > > > > On Wed, Jan 3, 2024 at 6:15 PM Elxan Eminov < > > >>> elxanemino...@gmail.com> > > >>> > > > > wrote: > > >>> > > > > > > > >>> > > > > > Hi all, > > >>> > > > > > Here is the vote thread: > > >>> > > > > > > > >>> https://lists.apache.org/thread/ftlnolcrh858dry89sjg06mdcdj9mrqv > > >>> > > > > > > > >>> > > > > > Cheers! > > >>> > > > > > > > >>> > > > > > On Wed, 27 Dec 2023 at 11:23, Elxan Eminov < > > >>> elxanemino...@gmail.com> > > >>> > > > > wrote: > > >>> > > > > > > > >>> > > > > > > Hi all, > > >>> > > > > > > I've updated the KIP with the details we discussed in this > > >>> thread. > > >>> > > > > > > I'll call in a vote after the holidays if everything looks > > >>> good. > > >>> > > > > > > Thanks! > > >>> > > > > > > > > >>> > > > > > > On Sat, 26 Aug 2023 at 15:49, Elxan Eminov < > > >>> > > elxanemino...@gmail.com> > > >>> > > > > > > wrote: > > >>> > > > > > > > > >>> > > > > > >> Relatively minor change with a new metric for MM2 > > >>> > > > > > >> > > >>> > > > > > >> > > >>> > > > > > > >>> > > > > >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-971%3A+Expose+replication-offset-lag+MirrorMaker2+metric > > >>> > > > > > >> > > >>> > > > > > > > > >>> > > > > > > >>> > > > > >>> > > >>