Re: Beam dropping events from Kafka after reshuffle ?

Jan Lukavský Tue, 17 Sep 2024 23:26:17 -0700

Hi Lydian,

2.41.0 is quite old, can you please try current version to see if thisissue is still present? There were lots of changes between 2.41.0 and2.59.0.


 Jan

On 9/17/24 17:49, Lydian Lee wrote:

Hi,
We are using Beam Python SDK with Flink Runner, the Beam version is2.41.0 and the Flink version is 1.15.4.
We have a pipeline that has 2 stages:
1. read from kafka and fixed window for every 1 minute
2. aggregate the data for the past 1 minute and reshuffle so that wehave less partition count and write them into s3.
We disabled the enable.auto.commit and enabledcommit_offset_in_finalize. also the auto.offset.reset is set to "latest"
image.png
According to the log, I can definitely find the data is consuming fromKafka Offset, Because there are many
```
Resetting offset for topic XXXX-<PARTITION>  to offset <OFFSET>
```
and that partition/offset pair does match the missing records. However, it doesn't show up in the final S3.
My current hypothesis is that the shuffling might be the reason forthe issue, for example, originally in kafka for the past minute inpartition 1, I have offset 1, 2, 3 records. After reshuffle, it nowdistribute, for example:
- partition A: 1, 3
- partition B: 2
And if partition A is done successfully but partition B fails. Giventhat A is succeeded, it will commit its offset to Kafka, and thuskafka now has an offset to 3. And when kafka retries , it will skipthe offset 2. However, I am not sure how exactly the offset commitworks, wondering how it interacts with the checkpoints. But it doesseem like if my hypothesis is correct, we should be seeing moremissing records, however, this seems rare to happen. Wondering ifanyone can help identify potential root causes? Thanks

Re: Beam dropping events from Kafka after reshuffle ?

Reply via email to