Hi, Jeremiah,

>From what you have answered, it looks to me as a transient error (probably
timeout due to some transient network errors as you mentioned) and your job
was able to retry/recover and make progress.

Just one thing to confirm: I saw your configured task.commit.ms=-1, and you
have mentioned that your checkpointed offset metrics DOES increment over
time. Are you calling commit in your user code?

Thanks!

-Yi

On Fri, Mar 13, 2020 at 9:46 AM Jeremiah Adams
<jad...@helixeducation.com.invalid> wrote:

>  Do you see the Samza job hanging after that?
> The job does not hang.
>
>
> Is the checkpointed offset metrics incrementing in this case?
> We do get incremented offsets.
>
> Not clear on your claiming: "logs stop at that point". No logs are written
> after the WARN lines?
> My apologies for the confusion - I see no lag messages related to the
> warning. I see all of our normal processing logs. I'm assuming this means
> the retry worked.
>
>
> What's your Samza configuration?
>
> job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory
> job.coordinator.replication.factor=1
> job.default.system=kafka
> systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092
>
> task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
> systems.kafka.producer.ssl.endpoint.identification.algorithm=https
> systems.kafka.producer.sasl.mechanism=PLAIN
> systems.kafka.producer.request.timeout.ms=20000
> systems.kafka.producer.retry.backoff.ms=500
> systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> required username="" password="";
> systems.kafka.producer.security.protocol=SASL_SSL
> systems.kafka.consumer.ssl.endpoint.identification.algorithm=https
> systems.kafka.consumer.sasl.mechanism=PLAIN
> systems.kafka.consumer.request.timeout.ms=20000
> systems.kafka.consumer.retry.backoff.ms=500
> systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> required username="" password="";
> systems.kafka.consumer.security.protocol=SASL_SSL
> processor.id=0
>
> # checkpointing
>
> task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
> task.checkpoint.system=kafka
> task.checkpoint.replication.factor=3
> task.commit.ms=-1
>
> Is the Samza container still running after you see those WARN logs?
> Yes.
>
>
> I am thinking this is a timeout issue. We've never seen the issue before.
> The warning first appeared after  testing Confluent's Cloud kafka offering.
> We had no issues when running our own kafka clusters in aws.
>
>
> Jeremiah Adams
> Software Engineer
> www.helixeducation.com
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Yi Pan <nickpa...@gmail.com>
> Sent: Wednesday, March 11, 2020 5:48 PM
> To: dev@samza.apache.org
> Subject: Re: Got Error Produce Respons with Correlation Id.
>
> Hi, Jeremiah,
>
> Sorry to reply late. This WARN message indicates that producer failed to
> flush to checkpoint topic and would retry. Do you see the Samza job hanging
> after that? Is the checkpointed offset metrics incrementing in this case?
> Not clear on your claiming: "logs stop at that point". No logs are written
> after the WARN lines? What's your Samza configuration? Is the Samza
> container still running after you see those WARN logs?
>
> Thanks!
>
> -Yi
>
> On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams
> <jad...@helixeducation.com.invalid> wrote:
>
> > Can anyone take a look at the message below? We are trying to gauge our
> > risk before moving forward.
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > Blog | Twitter | Facebook | LinkedIn
> >
> > ________________________________________
> > From: Jeremiah Adams <jad...@helixeducation.com.INVALID>
> > Sent: Wednesday, March 4, 2020 2:28 PM
> > To: dev@samza.apache.org
> > Subject: Got Error Produce Response iwth Correlation Id.
> >
> > Hello devs,
> >
> >
> > I've got a warning showing up in the logs while testing our new Confluent
> > Cloud config.  Can anyone tell me how concerned I should be about this
> > warning? Is there a setting to control timeouts?
> >
> >
> > Also, logs stop at that point, so I can't tell if the "metatdata update"
> > was complete.
> >
> >
> >
> > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > clientId=kafka_producer-application_submission-1] Got error produce
> > response with correlation id 144 on topic-partition
> > __samza_checkpoint_ver_1_for_application-submission_1-0, retrying
> > (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION
> > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > clientId=kafka_producer-application_submission-1] Received invalid
> metadata
> > error in produce request on partition
> > __samza_checkpoint_ver_1_for_application-submission_1-0 due to
> > org.apache.kafka.common.errors.NetworkException: The server disconnected
> > before a response was received.. Going to request metadata update now
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > <
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~
> > >
> > Blog<
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~
> >
> > | Twitter<
> >
> https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~
> >
> > | Facebook<
> >
> https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~
> >
> > | LinkedIn<
> >
> https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0
> > >
> >
>

Reply via email to