Do you see the Samza job hanging after that? The job does not hang.
Is the checkpointed offset metrics incrementing in this case? We do get incremented offsets. Not clear on your claiming: "logs stop at that point". No logs are written after the WARN lines? My apologies for the confusion - I see no lag messages related to the warning. I see all of our normal processing logs. I'm assuming this means the retry worked. What's your Samza configuration? job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory job.coordinator.replication.factor=1 job.default.system=kafka systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092 task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory systems.kafka.producer.ssl.endpoint.identification.algorithm=https systems.kafka.producer.sasl.mechanism=PLAIN systems.kafka.producer.request.timeout.ms=20000 systems.kafka.producer.retry.backoff.ms=500 systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="" password=""; systems.kafka.producer.security.protocol=SASL_SSL systems.kafka.consumer.ssl.endpoint.identification.algorithm=https systems.kafka.consumer.sasl.mechanism=PLAIN systems.kafka.consumer.request.timeout.ms=20000 systems.kafka.consumer.retry.backoff.ms=500 systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="" password=""; systems.kafka.consumer.security.protocol=SASL_SSL processor.id=0 # checkpointing task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory task.checkpoint.system=kafka task.checkpoint.replication.factor=3 task.commit.ms=-1 Is the Samza container still running after you see those WARN logs? Yes. I am thinking this is a timeout issue. We've never seen the issue before. The warning first appeared after testing Confluent's Cloud kafka offering. We had no issues when running our own kafka clusters in aws. Jeremiah Adams Software Engineer www.helixeducation.com Blog | Twitter | Facebook | LinkedIn ________________________________________ From: Yi Pan <nickpa...@gmail.com> Sent: Wednesday, March 11, 2020 5:48 PM To: dev@samza.apache.org Subject: Re: Got Error Produce Respons with Correlation Id. Hi, Jeremiah, Sorry to reply late. This WARN message indicates that producer failed to flush to checkpoint topic and would retry. Do you see the Samza job hanging after that? Is the checkpointed offset metrics incrementing in this case? Not clear on your claiming: "logs stop at that point". No logs are written after the WARN lines? What's your Samza configuration? Is the Samza container still running after you see those WARN logs? Thanks! -Yi On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams <jad...@helixeducation.com.invalid> wrote: > Can anyone take a look at the message below? We are trying to gauge our > risk before moving forward. > > > Jeremiah Adams > Software Engineer > https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~ > Blog | Twitter | Facebook | LinkedIn > > ________________________________________ > From: Jeremiah Adams <jad...@helixeducation.com.INVALID> > Sent: Wednesday, March 4, 2020 2:28 PM > To: dev@samza.apache.org > Subject: Got Error Produce Response iwth Correlation Id. > > Hello devs, > > > I've got a warning showing up in the logs while testing our new Confluent > Cloud config. Can anyone tell me how concerned I should be about this > warning? Is there a setting to control timeouts? > > > Also, logs stop at that point, so I can't tell if the "metatdata update" > was complete. > > > > 2020-03-04 21:17:51 Sender [WARN] [Producer > clientId=kafka_producer-application_submission-1] Got error produce > response with correlation id 144 on topic-partition > __samza_checkpoint_ver_1_for_application-submission_1-0, retrying > (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION > 2020-03-04 21:17:51 Sender [WARN] [Producer > clientId=kafka_producer-application_submission-1] Received invalid metadata > error in produce request on partition > __samza_checkpoint_ver_1_for_application-submission_1-0 due to > org.apache.kafka.common.errors.NetworkException: The server disconnected > before a response was received.. Going to request metadata update now > > > Jeremiah Adams > Software Engineer > > https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~ > < > https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~ > > > Blog< > https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~> > | Twitter< > https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~> > | Facebook< > https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~> > | LinkedIn< > https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0 > > >