Hi, Will, Can you check the description in SAMZA-1822 to see whether this is exactly the problem you encountered? We just submitted the fix today.
Thanks! On Tue, Aug 21, 2018 at 9:12 AM, Jagadish Venkatraman < jagadish1...@gmail.com> wrote: > Hi Will, > > Is the topic in question your change-log topic or the checkpoint-topic or > one of your inputs? (My understanding from reading this is its your > checkpoint) > > Can you please attach some more surrounding logs? > > Thanks, > Jagadish > > > > On Mon, Aug 20, 2018 at 6:16 AM, Will Schneider < > wschnei...@tripadvisor.com> > wrote: > > > Hello all, > > > > We've recently been experiencing some Kafka/Samza issues we're not quite > > sure how to tackle. We've exhausted all our internal expertise and were > > hoping that someone on the mailing lists might have seen this before and > > knows what might cause it: > > > > KafkaSystemConsumer [WARN] While refreshing brokers for [Store_LogParser_ > > RedactedMetadata_RedactedEnvironment,35]: org.apache.kafka.common. > errors.OffsetOutOfRangeException: > > The requested offset is not within the range of offsets maintained by the > > server.. Retrying. > > > > ^ (Above repeats indefinitely until we intervene) > > > > A bit about our use case: > > > > - Versions: > > - Kafka 1.0.1 (CDH Distribution 3.1.0-1.3.1.0.p0.35) > > - Samza 0.14.1 > > - Hadoop: 2.6.0-cdh5.12.1 > > - We've seen some manifestation of this error in 4 different > > environments with minor differences in configuration, but all running > the > > same versions of the software > > - Distributed Samza on Yarn (~10 node yarn environment, 3-7 node > > kafka environment) > > - Non-distributed virtual test environment (Samza on yarn, but with > > no network in between) > > - We have not found a reliable way to reproduce this error > > - Issue typically presents on process startup. It usually doesn't make > > a difference if the application was down for 5 minutes or 5 days > before > > that startup > > - The LogParser application experiencing this issue is reading and > > parsing a set of log files, and supplementing them with metadata > stored in > > the Store topic in question, and cached locally in RocksDB > > - The LogParser application has 40-60 running tasks and partitions > > depending on configuration > > - There is no discernable pattern for where the error presents itself: > > - It is not consistent WRT which yarn node hosts tasks with the > > issue > > - It is not consistent WRT which kafka node hosts the partitions > > relevant to the issue > > - The pattern does not persist with issue nodes upon consecutive > > appearances of the error > > - This leads us to believe the bug is probably endemic to the whole > > cluster and not the result of a random hardware issue > > - Offsets for the LogParser application are maintained in a samza > > topic called something like: > > - __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1 > > - Upon startup, checkpoints are refreshed from that topic, and > > we'll see something in the log similar to: > > - kafka.KafkaCheckpointManager [INFO] Read 6000 from topic: > > __samza_checkpoint_ver_1_for_LogParser-RedactedEnvironment_1. > > Current offset: 5999 > > - On more than one occasion, we have attempted to repair the job by > > killing individual yarn containers and letting samza retry them. > > - This will occasionally work. More frequently, it will get the > > partition stuck in a loop trying to read from the > __samza_checkpoint topic > > forever; we're suspicious that the retry loop above is storing > offsets one > > or many times, causing the topic to fill up considerably. > > - We are aware of only two workarounds: > > - 1- Fully clearing out the data disks on the Kafka servers and > > rebuilding the topics always seems to work, at least for a time. > > - 2- We can use a setting like: streams.Store_LogParser_ > > RedactedMetadata_RedactedEnvironment.samza.reset.offset=true, > which > > will necessarily ignore the checkpoint topic, and not bother to > validate > > any offset on the Store. > > - This works, but requires us to do a lengthy metadata refresh > > immediately after startup, which is less than ideal. > > - We have also seen this on rare occasion on other, smaller Samza > > tiers > > - In those cases, the common thread appears to be that the tier was > > left down for a period of time longer than the Kafka retention > timeout, and > > got stuck in the loop upon restart. Attempts at reproducing it > this way > > have been unsuccessful > > - Worth adding that in this case, adding the samza.reset.offset > > parameter to the configuration did not seem to have the intended > effect > > > > On another possibly-related note, one of our clusters periodically throws > > an error like this, but usually recovers without intervention: > > > > KafkaSystemAdmin [WARN] Exception while trying to get offset for > > SystemStreamPartition [kafka, Store_LogParser_RedactedMetadata_ > RedactedEnvironment, > > 32]: org.apache.kafka.common.errors.NotLeaderForPartitionException: This > > server is not the leader for that topic-partition.. Retrying. > > > > > > - We've seen this error message crop up when we've had issues with the > > network in our datacenter, but we're not aware of any such issue at > the > > times when we're experiencing the bigger issue. We're not sure if that > > might be related or not. > > > > > > Has anyone seen these errors before? Is there a known workaround or fix > > for it? > > > > Thanks for your help! > > > > Attached is a copy of the Samza configuration for the job in question, in > > case it contains more valuable information I may have missed. > > > > -Will Schneider > > > > > > > -- > Jagadish V, > Graduate Student, > Department of Computer Science, > Stanford University >