Cool! Sounds good to me! Happy to be the help!

-Yi

On Fri, Mar 13, 2020 at 1:13 PM Jeremiah Adams
<jad...@helixeducation.com.invalid> wrote:

> Yes, I explicitly commit via code for this job as an effort to ensure only
> once processing.
>
> Thanks for taking the time to look into our concerns.
>
> Jeremiah Adams
> Software Engineer
> www.helixeducation.com
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Yi Pan <nickpa...@gmail.com>
> Sent: Friday, March 13, 2020 1:27 PM
> To: dev@samza.apache.org
> Subject: Re: Got Error Produce Respons with Correlation Id.
>
> Hi, Jeremiah,
>
> From what you have answered, it looks to me as a transient error (probably
> timeout due to some transient network errors as you mentioned) and your job
> was able to retry/recover and make progress.
>
> Just one thing to confirm: I saw your configured task.commit.ms=-1, and
> you
> have mentioned that your checkpointed offset metrics DOES increment over
> time. Are you calling commit in your user code?
>
> Thanks!
>
> -Yi
>
> On Fri, Mar 13, 2020 at 9:46 AM Jeremiah Adams
> <jad...@helixeducation.com.invalid> wrote:
>
> >  Do you see the Samza job hanging after that?
> > The job does not hang.
> >
> >
> > Is the checkpointed offset metrics incrementing in this case?
> > We do get incremented offsets.
> >
> > Not clear on your claiming: "logs stop at that point". No logs are
> written
> > after the WARN lines?
> > My apologies for the confusion - I see no lag messages related to the
> > warning. I see all of our normal processing logs. I'm assuming this means
> > the retry worked.
> >
> >
> > What's your Samza configuration?
> >
> >
> job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory
> > job.coordinator.replication.factor=1
> > job.default.system=kafka
> > systems.kafka.producer.bootstrap.servers=<removed>.confluent.cloud:9092
> >
> >
> task.name.grouper.factory=org.apache.samza.container.grouper.task.GroupByContainerIdsFactory
> > systems.kafka.producer.ssl.endpoint.identification.algorithm=https
> > systems.kafka.producer.sasl.mechanism=PLAIN
> > systems.kafka.producer.request.timeout.ms=20000
> > systems.kafka.producer.retry.backoff.ms=500
> >
> systems.kafka.producer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> > required username="" password="";
> > systems.kafka.producer.security.protocol=SASL_SSL
> > systems.kafka.consumer.ssl.endpoint.identification.algorithm=https
> > systems.kafka.consumer.sasl.mechanism=PLAIN
> > systems.kafka.consumer.request.timeout.ms=20000
> > systems.kafka.consumer.retry.backoff.ms=500
> >
> systems.kafka.consumer.sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule
> > required username="" password="";
> > systems.kafka.consumer.security.protocol=SASL_SSL
> > processor.id=0
> >
> > # checkpointing
> >
> >
> task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
> > task.checkpoint.system=kafka
> > task.checkpoint.replication.factor=3
> > task.commit.ms=-1
> >
> > Is the Samza container still running after you see those WARN logs?
> > Yes.
> >
> >
> > I am thinking this is a timeout issue. We've never seen the issue before.
> > The warning first appeared after  testing Confluent's Cloud kafka
> offering.
> > We had no issues when running our own kafka clusters in aws.
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > Blog | Twitter | Facebook | LinkedIn
> >
> > ________________________________________
> > From: Yi Pan <nickpa...@gmail.com>
> > Sent: Wednesday, March 11, 2020 5:48 PM
> > To: dev@samza.apache.org
> > Subject: Re: Got Error Produce Respons with Correlation Id.
> >
> > Hi, Jeremiah,
> >
> > Sorry to reply late. This WARN message indicates that producer failed to
> > flush to checkpoint topic and would retry. Do you see the Samza job
> hanging
> > after that? Is the checkpointed offset metrics incrementing in this case?
> > Not clear on your claiming: "logs stop at that point". No logs are
> written
> > after the WARN lines? What's your Samza configuration? Is the Samza
> > container still running after you see those WARN logs?
> >
> > Thanks!
> >
> > -Yi
> >
> > On Wed, Mar 11, 2020 at 2:39 PM Jeremiah Adams
> > <jad...@helixeducation.com.invalid> wrote:
> >
> > > Can anyone take a look at the message below? We are trying to gauge our
> > > risk before moving forward.
> > >
> > >
> > > Jeremiah Adams
> > > Software Engineer
> > >
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > > Blog | Twitter | Facebook | LinkedIn
> > >
> > > ________________________________________
> > > From: Jeremiah Adams <jad...@helixeducation.com.INVALID>
> > > Sent: Wednesday, March 4, 2020 2:28 PM
> > > To: dev@samza.apache.org
> > > Subject: Got Error Produce Response iwth Correlation Id.
> > >
> > > Hello devs,
> > >
> > >
> > > I've got a warning showing up in the logs while testing our new
> Confluent
> > > Cloud config.  Can anyone tell me how concerned I should be about this
> > > warning? Is there a setting to control timeouts?
> > >
> > >
> > > Also, logs stop at that point, so I can't tell if the "metatdata
> update"
> > > was complete.
> > >
> > >
> > >
> > > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > > clientId=kafka_producer-application_submission-1] Got error produce
> > > response with correlation id 144 on topic-partition
> > > __samza_checkpoint_ver_1_for_application-submission_1-0, retrying
> > > (2147483646<tel:2147483646> attempts left). Error: NETWORK_EXCEPTION
> > > 2020-03-04 21:17:51 Sender [WARN] [Producer
> > > clientId=kafka_producer-application_submission-1] Received invalid
> > metadata
> > > error in produce request on partition
> > > __samza_checkpoint_ver_1_for_application-submission_1-0 due to
> > > org.apache.kafka.common.errors.NetworkException: The server
> disconnected
> > > before a response was received.. Going to request metadata update now
> > >
> > >
> > > Jeremiah Adams
> > > Software Engineer
> > >
> > >
> >
> https://url.emailprotection.link/?bM9S-3pRw1lv8pYfwa-TwdjElP4W2K6b9vP5Crz22L_YcgsRJ-13h-OgPZSwFtU7GSNTDi1z-jdaRvWESRhtTVA~~
> > > <
> > >
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df8YVt3SYmcjmLWdKMWzAAINWlUUA33ebGI7pSoTl9cg1g~~
> > > >
> > > Blog<
> > >
> >
> https://url.emailprotection.link/?basKr9vk92a8vVw0XMnK5bmaSKuBc0AuEZ7YasYc7Df-lAcqG1fqHPpNw-wd9z7HtUJeCG5_8UjCf2mHtn6C_zQ~~
> > >
> > > | Twitter<
> > >
> >
> https://url.emailprotection.link/?bVO2q0UXR235wN_yOnM0FjqITPdBYMD3reLGNddq-zPV5ChMQK9JwV4Be-QnrbRoXpJl8IcknAqKzYtA3RABKww~~
> > >
> > > | Facebook<
> > >
> >
> https://url.emailprotection.link/?bUU7m4NfMS_EWGtH1yojBHX9sWZ6uxVdT1eQUkmU5vWY01WFZiS2KJ-c9iLIncdHB7Uw1lRYCprEEpPPQCdiK6Q~~
> > >
> > > | LinkedIn<
> > >
> >
> https://url.emailprotection.link/?b0ZQfJ1pZYnASyoShs9MJI46-r1lxPhA-JS5VSkR7so-DFP0_HxbOo2LsajGOaoYXxb1ZCOMAu7hZscPCnIKWpXz0cpgQ386SnNHjPcwsu4z90mzBkuwoZc6YxOCzMGA0
> > > >
> > >
> >
>

Reply via email to