Thank you David and Liam for your excellent responses. Checking in the consumer will be extremely difficult. However, I am unclear what the record-error-rate|total metric for a producer means, Does the metric get incremented only when the record could not make it to the topic or even when there was a transient/retriable error trying to send the message to the topic ?
I am posting below the producer properties that I am using. acks = -1 batch.size = 16384 bootstrap.servers = [##MASKED##] buffer.memory = 23622320128 client.dns.lookup = use_all_dns_ips client.id = producer-1 compression.type = none connections.max.idle.ms = 540000 delivery.timeout.ms = 2880000 enable.idempotence = true interceptor.classes = [] internal.auto.downgrade.txn.commit = false key.serializer = class org.apache.kafka.common.serialization.StringSerializer linger.ms = 0 max.block.ms = 1440000 max.in.flight.requests.per.connection = 5 max.request.size = 1048576 metadata.max.age.ms = 7200000 metadata.max.idle.ms = 7200000 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 30000 partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner receive.buffer.bytes = 32768 reconnect.backoff.max.ms = 1000 reconnect.backoff.ms = 50 request.timeout.ms = 30000 retries = 2147483647 retry.backoff.ms = 100 sasl.client.callback.handler.class = null sasl.jaas.config = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 60000 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.login.callback.handler.class = null sasl.login.class = null sasl.login.refresh.buffer.seconds = 300 sasl.login.refresh.min.period.seconds = 60 sasl.login.refresh.window.factor = 0.8 sasl.login.refresh.window.jitter = 0.05 sasl.mechanism = GSSAPI security.protocol = PLAINTEXT security.providers = null send.buffer.bytes = 131072 ssl.cipher.suites = null ssl.enabled.protocols = [TLSv1.2, TLSv1.3] ssl.endpoint.identification.algorithm = https ssl.engine.factory.class = null ssl.key.password = null ssl.keymanager.algorithm = SunX509 ssl.keystore.location = null ssl.keystore.password = null ssl.keystore.type = JKS ssl.protocol = TLSv1.3 ssl.provider = null ssl.secure.random.implementation = null ssl.trustmanager.algorithm = PKIX ssl.truststore.location = null ssl.truststore.password = null ssl.truststore.type = JKS transaction.timeout.ms = 60000 transactional.id = null value.serializer = class io.vertx.kafka.client.serialization.JsonObjectSerializer Regards, Neeraj On Monday, 4 April, 2022, 03:19:08 pm GMT+10, David Finnie <david.fin...@infrasoft.com.au> wrote: Hi Neeraj, I don't know what might be causing the first Produce error. Is the OUT_OF_ORDER_SEQUENCE_NUMBER the first Produce error? From the error that you included (Invalid sequence number for new epoch) it would seem that the broker doesn't (yet) know about the Producer's epoch - possibly because it is still catching up after you restarted it? Note that the first sequence number for a new epoch must be 0, so if the broker thinks that it is a new epoch, but the sequence number is 3, it will cause this error. I can explain more about the relationship of Producer ID, Producer Epoch and Sequence Number if you want. With 5 in-flight requests per connection, if any Produce is rejected, all other in-flight Produce requests will be rejected with OUT_OF_ORDER_SEQUENCE_NUMBER because the first rejected Produce request's sequence number range never got stored, so all subsequent in-flight Produce requests are out of sequence. (e.g. if a Produce with sequence number 2 is rejected, and further Produce requests have already been sent with sequence numbers 3, 4 and 5, after sequence number 2 is rejected, the broker is still expecting the next sequence number to be 2, and so will reject sequence numbers 3, 4, 5 with this error.) When a Produce request is rejected, the Producer code first waits for responses to arrive for all in-flight Produce requests, and does not send any new Produce requests until all responses are received. The reason is that it doesn't know whether the other requests will receive rejections simply of OUT_OF_ORDER_SEQUENCE_NUMBER, or some other rejection reason. It needs to wait so that it knows which record batches to retry. It is most likely that the poor performance you are seeing is due to the need to wait for in-flight Produce requests to receive responses. There is therefore less throughput achievable during such error recovery times. Additionally, any in-flight Produce requests that are deemed retriable need to be re-sent, so there is a doubling up of their network traffic. Does this poor performance last for a long time? I would have thought that it should be just a minor hiccup in performance, because the Producer will increment the epoch, and reset sequence numbers from 0. That should then allow for resumption of normal traffic for that Producer ID on that partition. Re. whether the records made it to the topic, they should have. The log messages indicate that it is retrying the records, and incrementing the epoch and resequencing is part of that process. Of course, you should probably check by setting up a consumer to ensure that all messages made it to the topic, if that is feasible. David Finnie Infrasoft Pty Limited On 4/04/2022 12:42, Neeraj Vaidya wrote: > Hi Liam, > Thanks for getting back. > > 1) Producer settings ( I am guessing these are the ones you are interested in) > enable.idempotence=true > max.in.flight.requests.per.connection=5 > > 2) Sample broker logs corresponding to the timestamp in the application logs > of the Producer > > [2022-04-03 15:56:39,587] ERROR [ReplicaManager broker=5] Error processing > append operation on partition input-topic-114 (kafka.server.ReplicaManager) > org.apache.kafka.common.errors.OutOfOrderSequenceException: Invalid sequence > number for new epoch at offset 967756 in partition input-topic-114: 158 > (request epoch), 3 (seq. number) > > Do the producer errors indicate that these messages never made it to the > Kafka topic at all ? > > Regards, > Neeraj > On Monday, 4 April, 2022, 12:23:30 pm GMT+10, Liam >Clarke-Hutchinson<lclar...@redhat.com> wrote: > > Hi Neeraj, > > First off, what are your producer settings? > Secondly, do you have brokers logs for the leaders of some of your affected > topics on hand at all? > > Cheers, > > Liam Clarke-Hutchinson > > On Mon, 4 Apr 2022 at 14:04, Neeraj Vaidya > <neeraj.vai...@yahoo.co.in.invalid> wrote: > >> Hi All, >> For one of the Kafka producers that I have, I see that the Producer Record >> Error rate is non-zero i.e. out of the expected 3000 messages per second >> which I a expect to be producing to the topic, I can see that this metric >> shows a rate of about 200. >> Does this indicate that the records failed to be sent to the Kafka topic, >> or does this metric show up even for each retry in the Producer.Send >> operation ? >> >> Notes : >> 1) I have distributed 8 brokers equally across 2 sites. Using >> rack-awareness, I am making Kafka position replicas equally across both >> sites. My min.isr=2 and replication factor = 4. This makes 2 replicas to be >> located in each site. >> 2) The scenario I am testing is that of shutting down a set of 4 brokers >> in one site (out of 8) for an extended period of time and then bringing >> them back up after say 2 hours. This causes the the follower replicas on >> those brokers to try and catch-up with the leader replicas on the other >> brokers. The error rate that I am referring to shows up under this scenario >> of restarting the brokers. It does not show up when I have just the other >> set of (4) brokers. >> >> To be specific, here are the errors that I see in the Kafka producer log >> file: >> >> 2022-04-03 15:56:39.613 WARN --- [-thread | producer-1] >> o.a.k.c.p.i.Sender : [Producer clientId=producer-1] >> Got error produce response with correlation id 16512434 on topic-partition >> input-topic-114, retrying (2147483646 attempts left). Error: >> OUT_OF_ORDER_SEQUENCE_NUMBER >> 2022-04-03 15:56:39.613 WARN --- [-thread | producer-1] >> o.a.k.c.p.i.Sender : [Producer clientId=producer-1] >> Got error produce response with correlation id 16512434 on topic-partition >> input-topic-58, retrying (2147483646 attempts left). Error: >> OUT_OF_ORDER_SEQUENCE_NUMBER >> 2022-04-03 15:56:39.613 INFO --- [-thread | producer-1] >> o.a.k.c.p.i.TransactionManager : [Producer clientId=producer-1] >> ProducerId set to 2040 with epoch 159 >> 2022-04-03 15:56:39.613 INFO --- [-thread | producer-1] >> o.a.k.c.p.i.ProducerBatch : Resetting sequence number of >> batch with current sequence 3 for partition input-topic-114 to 0 >> 2022-04-03 15:56:39.613 INFO --- [-thread | producer-1] >> o.a.k.c.p.i.ProducerBatch : Resetting sequence number of >> batch with current sequence 5 for partition input-topic-114 to 2 >> 2022-04-03 15:56:39.613 INFO --- [-thread | producer-1] >> o.a.k.c.p.i.ProducerBatch : Resetting sequence number of >> batch with current sequence 6 for partition input-topic-114 to 3 >> 2022-04-03 15:56:39.613 INFO --- [-thread | producer-1] >> o.a.k.c.p.i.ProducerBatch : Resetting sequence number of >> batch with current sequence 1 for partition input-topic-58 to 0 >> 2022-04-03 15:56:39.739 WARN --- [-thread | producer-1] >> o.a.k.c.p.i.Sender : [Producer clientId=producer-1] >> Got error produce response with correlation id 16512436 on topic-partition >> input-topic-82, retrying (2147483646 attempts left). Error: >> OUT_OF_ORDER_SEQUENCE_NUMBER >> >> Regards, >> Neeraj >> >> >