Cool! Sounds good to me! Happy to be the help! -Yi
On Fri, Mar 13, 2020 at 1:13 PM Jeremiah Adams <jad...@helixeducation.com.invalid> wrote: > Yes, I explicitly commit via code for this job as an effort to ensure only > once processing. > > Thanks for taking the time to look into our concerns. > > Jeremiah Adams > Software Engineer > www.helixeducation.com > Blog | Twitter | Facebook | LinkedIn > > ________________________________________ > From: Yi Pan <nickpa...@gmail.com> > Sent: Friday, March 13, 2020 1:27 PM > To: dev@samza.apache.org > Subject: Re: Got Error Produce Respons with Correlation Id. > > Hi, Jeremiah, > > From what you have answered, it looks to me as a transient error (probably > timeout due to some transient network errors as you mentioned) and your job > was able to retry/recover and make progress. > > Just one thing to confirm: I saw your configured task.commit.ms=-1, and > you > have mentioned that your checkpointed offset metrics DOES increment over > time. Are you calling commit in your user code? > > Thanks! > > -Yi > > On Fri, Mar 13, 2020 at 9:46 AM Jeremiah Adams > <jad...@helixeducation.com.invalid> wrote: > > > Do you see the Samza job hanging after that? > > The job does not hang. > > > > > > Is the checkpointed offset metrics incrementing in this case? > > We do get incremented offsets. > > > > Not clear on your claiming: "logs stop at that point". No logs are > written > > after the WARN lines? > > My apologies for the confusion - I see no lag messages related to the > > warning. I see all of our normal processing logs. I'm assuming this means > > the retry worked. > > > > > > What's your Samza configuration? > > > > > job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory > > job.coordinator.replication.factor=1 > > job.default.system=kafka > > systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092 > > > > > task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory > > systems.kafka.producer.ssl.endpoint.identification.algorithm=https > > systems.kafka.producer.sasl.mechanism=PLAIN > > systems.kafka.producer.request.timeout.ms=20000 > > systems.kafka.producer.retry.backoff.ms=500 > > > systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule > > required username="" password=""; > > systems.kafka.producer.security.protocol=SASL_SSL > > systems.kafka.consumer.ssl.endpoint.identification.algorithm=https > > systems.kafka.consumer.sasl.mechanism=PLAIN > > systems.kafka.consumer.request.timeout.ms=20000 > > systems.kafka.consumer.retry.backoff.ms=500 > > > systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule > > required username="" password=""; > > systems.kafka.consumer.security.protocol=SASL_SSL > > processor.id=0 > > > > # checkpointing > > > > > task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory > > task.checkpoint.system=kafka > > task.checkpoint.replication.factor=3 > > task.commit.ms=-1 > > > > Is the Samza container still running after you see those WARN logs? > > Yes. > > > > > > I am thinking this is a timeout issue. We've never seen the issue before. > > The warning first appeared after testing Confluent's Cloud kafka > offering. > > We had no issues when running our own kafka clusters in aws. > > > > > > Jeremiah Adams > > Software Engineer > > > https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~ > > Blog | Twitter | Facebook | LinkedIn > > > > ________________________________________ > > From: Yi Pan <nickpa...@gmail.com> > > Sent: Wednesday, March 11, 2020 5:48 PM > > To: dev@samza.apache.org > > Subject: Re: Got Error Produce Respons with Correlation Id. > > > > Hi, Jeremiah, > > > > Sorry to reply late. This WARN message indicates that producer failed to > > flush to checkpoint topic and would retry. Do you see the Samza job > hanging > > after that? Is the checkpointed offset metrics incrementing in this case? > > Not clear on your claiming: "logs stop at that point". No logs are > written > > after the WARN lines? What's your Samza configuration? Is the Samza > > container still running after you see those WARN logs? > > > > Thanks! > > > > -Yi > > > > On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams > > <jad...@helixeducation.com.invalid> wrote: > > > > > Can anyone take a look at the message below? We are trying to gauge our > > > risk before moving forward. > > > > > > > > > Jeremiah Adams > > > Software Engineer > > > > > > https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~ > > > Blog | Twitter | Facebook | LinkedIn > > > > > > ________________________________________ > > > From: Jeremiah Adams <jad...@helixeducation.com.INVALID> > > > Sent: Wednesday, March 4, 2020 2:28 PM > > > To: dev@samza.apache.org > > > Subject: Got Error Produce Response iwth Correlation Id. > > > > > > Hello devs, > > > > > > > > > I've got a warning showing up in the logs while testing our new > Confluent > > > Cloud config. Can anyone tell me how concerned I should be about this > > > warning? Is there a setting to control timeouts? > > > > > > > > > Also, logs stop at that point, so I can't tell if the "metatdata > update" > > > was complete. > > > > > > > > > > > > 2020-03-04 21:17:51 Sender [WARN] [Producer > > > clientId=kafka_producer-application_submission-1] Got error produce > > > response with correlation id 144 on topic-partition > > > __samza_checkpoint_ver_1_for_application-submission_1-0, retrying > > > (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION > > > 2020-03-04 21:17:51 Sender [WARN] [Producer > > > clientId=kafka_producer-application_submission-1] Received invalid > > metadata > > > error in produce request on partition > > > __samza_checkpoint_ver_1_for_application-submission_1-0 due to > > > org.apache.kafka.common.errors.NetworkException: The server > disconnected > > > before a response was received.. Going to request metadata update now > > > > > > > > > Jeremiah Adams > > > Software Engineer > > > > > > > > > https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~ > > > < > > > > > > https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~ > > > > > > > Blog< > > > > > > https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~ > > > > > > | Twitter< > > > > > > https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~ > > > > > > | Facebook< > > > > > > https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~ > > > > > > | LinkedIn< > > > > > > https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0 > > > > > > > > > >