Hello, we currently see the same WARN logs when the App is processing low rate of records, as soon the app starts processing a high volume of records, those WARN stop showing in the logs. According to other email threads, this should be fixed with https://issues.apache.org/jira/browse/KAFKA-7190.
We use version 2.2.1, with EOS Cheers! -- Jonathan On Tue, Jun 11, 2019 at 11:04 AM Pieter Hameete <pieter.hame...@blockbax.com> wrote: > Hi Guozhang, > > Thanks that clarifies a lot. The segment size or segment ms is not the > issue here then, changing these properties does not make the errors go away > either 😊 > > Regarding the proposed solutions: > > 1) We use EOS, so the commit interval is defaulted to 100 ms in our case. > We don't want to delay consumers that respect transactions in their > consumption of the produced events so increasing this config is not a good > option for us. We have tried temporarily increasing it to 30s and indeed > partitions receiving records at a higher frequency (i.e. 1 per 5 seconds) > stopped resulting in errors. Partitions receiving records at a lower (i.e. > 1 per minute) frequency were still resulting in errors. > > 2) We have a small disparity in keys, so the data is already spread over a > small (3 or 4) number of partitions. I don't think reducing the number of > partitions to for example 4 will make a difference in this case? > > My attempt to summarize the issue here would be the following: > > If exactly once is enabled and records are produced on a partition less > frequent than the commit interval, then the brokers/producers will produce > an UNKNOWN_PRODUCER_ID error/warning on each produced record and the > producers need to retry once for every produced record. This happens > because commits trigger truncation of the producer ID map causing producers > that did not produce any messages since last commit to have their ID > removed. > > In our case we use a Kafka Streams application with exactly once enabled > and a default commit.interval.ms of 100ms. We don't wish to increase this > default to allow the consumers to be as real-time as possible. Even if we > would scale down to 1 partition (which is undesirable) I don't think the > producers would produce a record to this one partition between every offset > commit. Basically we can't resolve this problem unless we would have a > volume large enough to produce a record on each partition between every > offset commit. > > Our only options seem to be to move from exactly once to at least once, or > to wait for > https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer. > > > Best, > > Pieter > > -----Oorspronkelijk bericht----- > Van: Guozhang Wang <wangg...@gmail.com> > Verzonden: Monday, 10 June 2019 22:19 > Aan: users@kafka.apache.org > Onderwerp: Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka streams > applications > > Hi Pieter, > > My reasoning is that the most recent segment (called the `active` > segment`) would not be deleted for immediately since it is still being > appended. I.e. > say you have two segments offset range at [0, 100), [100, 180). And if the > delete records API is issued to delete any records older than 150, then the > first segment would be deleted, but the second (and the active one) would > not be deleted. Instead we would only set the log-starting-offset at 150. > > But when I double checked the code again, I realized that even if we do > not delete the segment, we would still "truncate" the maintained producer > ID map, and hence what you've seen is correct: although the data cleaner > did not delete the segment file yet, corresponding producer ID would still > be removed. > > So for now what I can think of to reduce this issue, are that you can > either consider 1) increase your commit interval, and hence reduce the > delete-records frequency, or 2) decrease your num.partitions so that the > per-partition traffic from a single producer can be reasonably large. > > > Guozhang > > > On Fri, Jun 7, 2019 at 2:09 AM Pieter Hameete <pieter.hame...@blockbax.com > > > wrote: > > > Hi Guozhang, > > > > I've checked the log-cleaner.log on the brokers for cleanup of > > repartition topics for the event-rule-engine application, but I found > > no reports of frequent cleanup there. > > > > Also forgive me if I am wrong, because I don’t understand the exact > > mechanics of the log cleanup processes. But didn't KAFKA-6150 ( > > https://github.com/apache/kafka/pull/4270/) introduce a log cleanup > > mechanic for repartition topics that cleans up the logs more > > aggressively (i.e. not following the regular LogCleaner cleanup > > process that is triggered by segment.ms and segment.bytes) by deleting > > records via the admin client based on committed input topic offsets. > > This would explain producer id's not being known if the logs get > > cleaned up almost instantly via this process. > > > > Best, > > > > Pieter > > > > -----Oorspronkelijk bericht----- > > Van: Guozhang Wang <wangg...@gmail.com> > > Verzonden: Thursday, 6 June 2019 18:16 > > Aan: users@kafka.apache.org > > Onderwerp: Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka streams > > applications > > > > Pieter, > > > > KIP-360 should be able to fix it, yes, but it won't be completed soon, > > and the earliest release it may get in is in 2.4. At the mean time, I > > thought that be increasing the segment size, hence reducing the > > frequency that records being truncated and hence causing producer ID > > to be removed to be much smaller. > > > > If you look at the broker-side logs, could you see if the log cleanup > > threads logs that truncating the > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartition-xx > > partitions? If it is too frequent that would cause the producer ID > > stored on brokers to be removed too early and therefore > > UNKNOWN_PRODUCER_ID, but overriding the segment size / ms should > > prevent it from being triggered too often. > > > > Guozhang > > > > > > On Thu, Jun 6, 2019 at 5:10 AM Pieter Hameete > > <pieter.hame...@blockbax.com > > > > > wrote: > > > > > Hello, > > > > > > We also found this earlier email in the archives which looks very > > > much like what we are experiencing: > > > > > > > > > http://mail-archives.apache.org/mod_mbox/kafka-users/201811.mbox/%3C > > > CA > > > M0VdefApmc5wBZQaJmQtbcnZ_OOgGv84qCuPoJS-KU4B=e...@mail.gmail.com%3E > > > > > > So it seems like: > > > > > > * It only happens with EXACTLY_ONCE processing > > > * It only happens on repartition topics (probably due to the changes > > > in > > > https://issues.apache.org/jira/browse/KAFKA-6150) > > > * The errors are happening continuously (on each produced message) > > > > > > The email I referred to points to > > > https://issues.apache.org/jira/browse/KAFKA-7190 for ongoing work on > > > solving this issue. However, it seems that this 7190 solves the > > > UNKNOWN_PRODUCER_ID with a different cause (low traffic). The hotfix > > > of changing the retention settings on the topic does not seem to > > > resolve the errors in our case. > > > > > > Do you think the changes listed under KIP-360 > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve > > > +h > > > andling+of+unknown+producer > > > could solve what we are seeing here? > > > > > > Best, > > > > > > Pieter > > > > > > -----Oorspronkelijk bericht----- > > > Van: Pieter Hameete <pieter.hame...@blockbax.com> > > > Verzonden: Thursday, 6 June 2019 13:39 > > > Aan: users@kafka.apache.org > > > Onderwerp: RE: Repeating UNKNOWN_PRODUCER_ID errors for Kafka > > > streams applications > > > > > > Hi Guozhang, Matthias, > > > > > > @Guozhang The brokers are on version 2.2.0-cp2 and the clients are > > > version 2.2.1. We have upgraded yesterday from 2.1.1 to this version > > > to see if it would make a difference, but unfortunately not. After > > > restarting the brokers and streams applications the broker errors > > > and client warnings continued as before. > > > > > > @Matthias I've increased the segment.bytes to be 10 times the > > > default but this does not change anything. > > > > > > For my understanding: why are retention settings important here? > > > Because the producer which is logging the UNKNOWN_PRODUCER_ID > > > exception produced a message 5 seconds earlier (with the same > > > exception though..). I remember from one of your comments (Matthias) > > > that the brokers build up the set of known producers from the log. > > > If a message was produced on a partition 5 seconds before by the > > > same producer (confirmed via kafkacat), how can it be the broker > > > throws an > > UNKNOWN_PRODUCER_ID exception? > > > > > > Best, > > > > > > Pieter > > > > > > -----Oorspronkelijk bericht----- > > > Van: Guozhang Wang <wangg...@gmail.com> > > > Verzonden: Thursday, 6 June 2019 01:39 > > > Aan: users@kafka.apache.org > > > Onderwerp: Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka > > > streams applications > > > > > > Hello Pieter, > > > > > > It's normal to see offset increasing by 2: that is because each > > > message on that partition is from a separate transaction which > > > Streams EOS leverage on, and whenever a a txn commits a commit > > > marker, written as a special message would be appended but not exposed > to consumers. > > > And hence that txn marker will take one offset as well. > > > > > > Your overridden configs look good to me, and it's indeed weird why > > > it did not mitigate your observed issue. Which Kafka version did you > use? > > > > > > > > > Guozhang > > > > > > > > > On Wed, Jun 5, 2019 at 4:43 AM Pieter Hameete > > > <pieter.hame...@blockbax.com > > > > > > > wrote: > > > > > > > Hi Guozhang, > > > > > > > > Some additional finding: it seems to only happen on Kakfa Streams > > > > repartition topics. We haven't seen this happening for any other > > > > topics so far. > > > > > > > > Best, > > > > > > > > Pieter > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Pieter Hameete <pieter.hame...@blockbax.com> > > > > Verzonden: Wednesday, 5 June 2019 11:23 > > > > Aan: users@kafka.apache.org > > > > Onderwerp: RE: Repeating UNKNOWN_PRODUCER_ID errors for Kafka > > > > streams applications > > > > > > > > Hi Guozhang, > > > > > > > > Thanks for your reply! I noticed my original mail went out twice > > > > by accident, sorry for that. > > > > > > > > We currently have a small variety of keys so not all partitions > > > > are 'actively used' indeed. The strange thing is though is that > > > > the errors occur for the partitions that actively receive records > > > > every few seconds. I have checked this using kafkacat to consume > > > > the specific > > > partitions. > > > > Something I noticed was that for each received record the > > > > partition offset was 2 higher than the previous record, instead of > > > > the expected 1. Could that be due to the producers retrying (see > > > > warning logs in my original mail)? > > > > > > > > I've done the override for the configs in the repartition topics > > > > as follows, on one of the brokers: > > > > > > > > The values are taken from your KIP-443 > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-443%3A+Retur > > > > n+ > > > > to > > > > +default+segment.ms+and+segment.index.bytes+in+Streams+repartition > > > > +default++t > > > > +default+op > > > > ics > > > > > > > > kafka-topics --zookeeper localhost:2181 --alter --topic > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio > > > > n --config segment.index.bytes=10485760 kafka-topics --zookeeper > > > > localhost:2181 --alter --topic > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio > > > > n --config segment.bytes= 52428800 kafka-topics --zookeeper > > > > localhost:2181 --alter --topic > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio > > > > n --config segment.ms=604800000 kafka-topics --zookeeper > > > > localhost:2181 --alter --topic > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio > > > > n > > > > --config retention.ms=-1 > > > > > > > > Verifying afterwards: > > > > > > > > kafka-topics --zookeeper localhost:2181 --describe --topic > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartitio > > > > n > > > > > > > > Topic:event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repa > > > > rt > > > > it > > > > ion > > > > > > > > PartitionCount:32 ReplicationFactor:3 > > > > Configs:segment.bytes=52428800,retention.ms > > > > =-1,segment.index.bytes=10485760,segment.ms > > > > =604800000,cleanup.policy=delete > > > > > > > > Is there anything that seems off to you? Or something else I can > > > > investigate further? We'd really like to nail this issue down. > > > > Especially because the cause seems different than the 'low traffic' > > > > cause in JIRA issue KAFKA-7190 as the partitions for which errors > > > > are thrown are receiving data. > > > > > > > > Best, > > > > > > > > Pieter > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Guozhang Wang <wangg...@gmail.com> > > > > Verzonden: Wednesday, 5 June 2019 02:23 > > > > Aan: users@kafka.apache.org > > > > Onderwerp: Re: Repeating UNKNOWN_PRODUCER_ID errors for Kafka > > > > streams applications > > > > > > > > Hello Pieter, > > > > > > > > If you only have one record every few seconds that may be too > > > > small given you have at least 25 partitions (as I saw you have a > > > > xxx--repartition-24 partition), which means that for a single > > > > partition, it may not see any records for a long time, and in this > > > > case you may need to override it to very large values. On the > > > > other hand, if you can reduce your num.partitions that may also > > > > help > > > increasing the traffic per partition. > > > > > > > > Also could you show me how did you override the configs in the > > > > repartition topics? > > > > > > > > > > > > Guozhang > > > > > > > > On Tue, Jun 4, 2019 at 2:10 AM Pieter Hameete > > > > <pieter.hame...@blockbax.com > > > > > > > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > Our Kafka streams applications are showing the following warning > > > > > every few seconds (on each of our 3 brokers, and on each of the > > > > > 2 instances of the streams application): > > > > > > > > > > > > > > > [Producer > > > > > clientId=event-rule-engine-dd71ae9b-523c-425d-a7c0-c62993315b30- > > > > > St > > > > > re am Thread-1-1_24-producer, > > > > > transactionalId=event-rule-engine-1_24] > > > > > Resetting sequence number of batch with current sequence 1 for > > > > > partition > > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartit > > > > > io > > > > > n- > > > > > 24 > > > > > to 0 > > > > > > > > > > > > > > > > > > > > Followed by: > > > > > > > > > > > > > > > > > > > > [Producer > > > > > clientId=event-rule-engine-dd71ae9b-523c-425d-a7c0-c62993315b30- > > > > > St > > > > > re am Thread-1-1_24-producer, > > > > > transactionalId=event-rule-engine-1_24] > > > > > Got error produce response with correlation id 5902 on > > > > > topic-partition > > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartit > > > > > io > > > > > n- > > > > > 24 , retrying (2147483646 attempts left). Error: > > > > > UNKNOWN_PRODUCER_ID > > > > > > > > > > The brokers are showing errors that look related: > > > > > > > > > > > > > > > Error processing append operation on partition > > > > > event-rule-engine-KSTREAM-REDUCE-STATE-STORE-0000000015-repartit > > > > > io > > > > > n- > > > > > 24 > > > > > (kafka.server.ReplicaManager) > > > > > > > > > > org.apache.kafka.common.errors.UnknownProducerIdException: Found > > > > > no record of producerId=72 on the broker. It is possible that > > > > > the last message with the producerId=72 has been removed due to > > > > > hitting the > > > > retention limit. > > > > > > > > > > > > > > > > > > > > We would expect the UNKNOWN_PRODUCER_ID error to occur once. > > > > > After a retry the record would be published on the partition and > > > > > the PRODUCER_ID would be known. However, this error keeps > > > > > occurring every few seconds. This is roughly at the same rate at > > > > > which records are produced on the input topics partitions, so it > > > > > seems like it occurs for > > > > (nearly) every input record. > > > > > > > > > > > > > > > > > > > > The following JIRA issue: > > > > > https://issues.apache.org/jira/browse/KAFKA-7190 > > > > > looks related. Except the Jira issue mentions ‘little traffic’, > > > > > and I am not sure if a message per every few seconds is regarded > > > > > as little > > > > traffic. > > > > > Matthias mentions in the issue that a workaround seems to be to > > > > > increase topic configs `segment.bytes`, `segment.index.bytes`, > > > > > and `segment.ms` for the corresponding repartition topics. We’ve > > > > > tried manually overriding these configs for a relevant topic to > > > > > the config values in the linked pull request > > > > > (https://github.com/apache/kafka/pull/6511) but this did not > > > > > result in > > > > the errors disappearing. > > > > > > > > > > > > > > > > > > > > Could anyone help us to figure out what is happening here, and > > > > > why the proposed fix for the above JIRA issue is not working in > > > > > this > > case? > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > Pieter > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > > -- > > -- Guozhang > > > > > -- > -- Guozhang > -- Santilli Jonathan