Re: Samza/Yarn cluster having issue with OffsetOutOfRangeException

Yi Pan Mon, 27 Aug 2018 15:01:44 -0700

Hi, Will,

Can you check the description in SAMZA-1822 to see whether this is exactly
the problem you encountered? We just submitted the fix today.


Thanks!

On Tue, Aug 21, 2018 at 9:12 AM, Jagadish Venkatraman <
jagadish1...@gmail.com> wrote:

> Hi Will,
>
> Is the topic in question your change-log topic or the checkpoint-topic or
> one of your inputs? (My understanding from reading this is its your
> checkpoint)
>
> Can you please attach some more surrounding logs?
>
> Thanks,
> Jagadish
>
>
>
> On Mon, Aug 20, 2018 at 6:16 AM, Will Schneider <
> wschnei...@tripadvisor.com>
> wrote:
>
> > Hello all,
> >
> > We've recently been experiencing some Kafka/Samza issues we're not quite
> > sure how to tackle. We've exhausted all our internal expertise and were
> > hoping that someone on the mailing lists might have seen this before and
> > knows what might cause it:
> >
> > KafkaSystemConsumer [WARN] While refreshing brokers for [Store_LogParser_
> > RedactedMetadata_RedactedEnvironment,35]: org.apache.kafka.common.
> errors.OffsetOutOfRangeException:
> > The requested offset is not within the range of offsets maintained by the
> > server.. Retrying.
> >
> > ^ (Above repeats indefinitely until we intervene)
> >
> > A bit about our use case:
> >
> >    - Versions:
> >       - Kafka 1.0.1 (CDH Distribution 3.1.0-1.3.1.0.p0.35)
> >       - Samza 0.14.1
> >       - Hadoop: 2.6.0-cdh5.12.1
> >    - We've seen some manifestation of this error in 4 different
> >    environments with minor differences in configuration, but all running
> the
> >    same versions of the software
> >       - Distributed Samza on Yarn (~10 node yarn environment, 3-7 node
> >       kafka environment)
> >       - Non-distributed virtual test environment (Samza on yarn, but with
> >       no network in between)
> >    - We have not found a reliable way to reproduce this error
> >    - Issue typically presents on process startup. It usually doesn't make
> >    a difference if the application was down for 5 minutes or 5 days
> before
> >    that startup
> >    - The LogParser application experiencing this issue is reading and
> >    parsing a set of log files, and supplementing them with metadata
> stored in
> >    the Store topic in question, and cached locally in RocksDB
> >    - The LogParser application has 40-60 running tasks and partitions
> >    depending on configuration
> >    - There is no discernable pattern for where the error presents itself:
> >       - It is not consistent WRT which yarn node hosts tasks with the
> >       issue
> >       - It is not consistent WRT which kafka node hosts the partitions
> >       relevant to the issue
> >       - The pattern does not persist with issue nodes upon consecutive
> >       appearances of the error
> >       - This leads us to believe the bug is probably endemic to the whole
> >       cluster and not the result of a random hardware issue
> >    - Offsets for the LogParser application are maintained in a samza
> >    topic called something like:
> >       - __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1
> >       - Upon startup, checkpoints are refreshed from that topic, and
> >    we'll see something in the log similar to:
> >       - kafka.KafkaCheckpointManager [INFO] Read 6000 from topic:
> >       __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1.
> >       Current offset: 5999
> >       - On more than one occasion, we have attempted to repair the job by
> >       killing individual yarn containers and letting samza retry them.
> >       - This will occasionally work. More frequently, it will get the
> >          partition stuck in a loop trying to read from the
> __samza_checkpoint topic
> >          forever; we're suspicious that the retry loop above is storing
> offsets one
> >          or many times, causing the topic to fill up considerably.
> >       - We are aware of only two workarounds:
> >       - 1- Fully clearing out the data disks on the Kafka servers and
> >       rebuilding the topics always seems to work, at least for a time.
> >       - 2- We can use a setting like: streams.Store_LogParser_
> >       RedactedMetadata_RedactedEnvironment.samza.reset.offset=true,
> which
> >       will necessarily ignore the checkpoint topic, and not bother to
> validate
> >       any offset on the Store.
> >          - This works, but requires us to do a lengthy metadata refresh
> >          immediately after startup, which is less than ideal.
> >       - We have also seen this on rare occasion on other, smaller Samza
> >    tiers
> >       - In those cases, the common thread appears to be that the tier was
> >       left down for a period of time longer than the Kafka retention
> timeout, and
> >       got stuck in the loop upon restart. Attempts at reproducing it
> this way
> >       have been unsuccessful
> >       - Worth adding that in this case, adding the samza.reset.offset
> >       parameter to the configuration did not seem to have the intended
> effect
> >
> > On another possibly-related note, one of our clusters periodically throws
> > an error like this, but usually recovers without intervention:
> >
> > KafkaSystemAdmin [WARN] Exception while trying to get offset for
> > SystemStreamPartition [kafka, Store_LogParser_RedactedMetadata_
> RedactedEnvironment,
> > 32]: org.apache.kafka.common.errors.NotLeaderForPartitionException: This
> > server is not the leader for that topic-partition.. Retrying.
> >
> >
> >    - We've seen this error message crop up when we've had issues with the
> >    network in our datacenter, but we're not aware of any such issue at
> the
> >    times when we're experiencing the bigger issue. We're not sure if that
> >    might be related or not.
> >
> >
> > Has anyone seen these errors before? Is there a known workaround or fix
> > for it?
> >
> > Thanks for your help!
> >
> > Attached is a copy of the Samza configuration for the job in question, in
> > case it contains more valuable information I may have missed.
> >
> > -Will Schneider
> >
> >
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>

Re: Samza/Yarn cluster having issue with OffsetOutOfRangeException

Reply via email to