Fwd: How and where iceberg spark streaming determines latest StreamingOffset upon trigger

Nirav Patel Wed, 10 Apr 2024 08:45:43 -0700

We are encountering the following issue where spark streaming read job from
iceberg table stays stuck after some maintenance jobs (rewrite_data_files
and rewrite_manifests) has been ran on parallel on same table.
https://github.com/apache/iceberg/issues/10117



I'm trying to understand what creates the end `StreamingOffset` from the
latest metadata.json file.
My theory is  `SparkMicroBatchStream.latestOffset()` doesn't seem returning
correct latest offset for table from latest metadata json or
`SparkMicroBatchStream.planFiles()` is not returning any Set when it
encounters `replace` snapshot.

Are these 2 methods right places that determines what iceberg data
files/partitions MicroBatchExcecution will process upon trigger?

Fwd: How and where iceberg spark streaming determines latest StreamingOffset upon trigger

Reply via email to