Hi, Jeremiah, >From what you have answered, it looks to me as a transient error (probably timeout due to some transient network errors as you mentioned) and your job was able to retry/recover and make progress.
Just one thing to confirm: I saw your configured task.commit.ms=-1, and you have mentioned that your checkpointed offset metrics DOES increment over time. Are you calling commit in your user code? Thanks! -Yi On Fri, Mar 13, 2020 at 9:46 AM Jeremiah Adams <jad...@helixeducation.com.invalid> wrote: > Do you see the Samza job hanging after that? > The job does not hang. > > > Is the checkpointed offset metrics incrementing in this case? > We do get incremented offsets. > > Not clear on your claiming: "logs stop at that point". No logs are written > after the WARN lines? > My apologies for the confusion - I see no lag messages related to the > warning. I see all of our normal processing logs. I'm assuming this means > the retry worked. > > > What's your Samza configuration? > > job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory > job.coordinator.replication.factor=1 > job.default.system=kafka > systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092 > > task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory > systems.kafka.producer.ssl.endpoint.identification.algorithm=https > systems.kafka.producer.sasl.mechanism=PLAIN > systems.kafka.producer.request.timeout.ms=20000 > systems.kafka.producer.retry.backoff.ms=500 > systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule > required username="" password=""; > systems.kafka.producer.security.protocol=SASL_SSL > systems.kafka.consumer.ssl.endpoint.identification.algorithm=https > systems.kafka.consumer.sasl.mechanism=PLAIN > systems.kafka.consumer.request.timeout.ms=20000 > systems.kafka.consumer.retry.backoff.ms=500 > systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule > required username="" password=""; > systems.kafka.consumer.security.protocol=SASL_SSL > processor.id=0 > > # checkpointing > > task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory > task.checkpoint.system=kafka > task.checkpoint.replication.factor=3 > task.commit.ms=-1 > > Is the Samza container still running after you see those WARN logs? > Yes. > > > I am thinking this is a timeout issue. We've never seen the issue before. > The warning first appeared after testing Confluent's Cloud kafka offering. > We had no issues when running our own kafka clusters in aws. > > > Jeremiah Adams > Software Engineer > www.helixeducation.com > Blog | Twitter | Facebook | LinkedIn > > ________________________________________ > From: Yi Pan <nickpa...@gmail.com> > Sent: Wednesday, March 11, 2020 5:48 PM > To: dev@samza.apache.org > Subject: Re: Got Error Produce Respons with Correlation Id. > > Hi, Jeremiah, > > Sorry to reply late. This WARN message indicates that producer failed to > flush to checkpoint topic and would retry. Do you see the Samza job hanging > after that? Is the checkpointed offset metrics incrementing in this case? > Not clear on your claiming: "logs stop at that point". No logs are written > after the WARN lines? What's your Samza configuration? Is the Samza > container still running after you see those WARN logs? > > Thanks! > > -Yi > > On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams > <jad...@helixeducation.com.invalid> wrote: > > > Can anyone take a look at the message below? We are trying to gauge our > > risk before moving forward. > > > > > > Jeremiah Adams > > Software Engineer > > > https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~ > > Blog | Twitter | Facebook | LinkedIn > > > > ________________________________________ > > From: Jeremiah Adams <jad...@helixeducation.com.INVALID> > > Sent: Wednesday, March 4, 2020 2:28 PM > > To: dev@samza.apache.org > > Subject: Got Error Produce Response iwth Correlation Id. > > > > Hello devs, > > > > > > I've got a warning showing up in the logs while testing our new Confluent > > Cloud config. Can anyone tell me how concerned I should be about this > > warning? Is there a setting to control timeouts? > > > > > > Also, logs stop at that point, so I can't tell if the "metatdata update" > > was complete. > > > > > > > > 2020-03-04 21:17:51 Sender [WARN] [Producer > > clientId=kafka_producer-application_submission-1] Got error produce > > response with correlation id 144 on topic-partition > > __samza_checkpoint_ver_1_for_application-submission_1-0, retrying > > (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION > > 2020-03-04 21:17:51 Sender [WARN] [Producer > > clientId=kafka_producer-application_submission-1] Received invalid > metadata > > error in produce request on partition > > __samza_checkpoint_ver_1_for_application-submission_1-0 due to > > org.apache.kafka.common.errors.NetworkException: The server disconnected > > before a response was received.. Going to request metadata update now > > > > > > Jeremiah Adams > > Software Engineer > > > > > https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~ > > < > > > https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~ > > > > > Blog< > > > https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~ > > > > | Twitter< > > > https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~ > > > > | Facebook< > > > https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~ > > > > | LinkedIn< > > > https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0 > > > > > >