Hi Chetas,

See my answers below:


On Tue, Apr 16, 2024, 06:39 Chetas Joshi <chetas.jo...@gmail.com> wrote:

> Hello,
>
> I am running a batch flink job to read an iceberg table. I want to
> understand a few things.
>
> 1. How does the FlinkSplitPlanner decide which fileScanTasks (I think one
> task corresponds to one data file) need to be clubbed together within a
> single split and when to create a new split?
>

You can take a look at the generic read properties for Iceberg tables:
https://iceberg.apache.org/docs/nightly/configuration/#read-properties

The most interesting ones for you are:
- read.split.target-size
- read.split.metadata-target-size
- read.split.planning-lookback
- read.split.open-file-cost

2. When the number of task slots is limited, what is the sequence in which
> the splits are assigned to the task slots?
> For example,  if there are 4 task slots available but the number of splits
> (source parallelism) to be read is 8, which 4 splits will be sent to the
> task slots first? Where in the codebase does this logic exist?
>

As a general rule, there is no pre-defined order between the splits, and
because of the parallelism, the order of the records are not defined.

It is a bit low level API, and might be removed in the future, but you can
define your own comparator to order the splits:
https://github.com/apache/iceberg/blob/fbcd142c5dc1ec99792ef8edc1378e3a027fecf7/flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java#L248

Or you can use the fileSequenceNumber comparator to order the splits based
on the commit order:
https://github.com/apache/iceberg/blob/fbcd142c5dc1ec99792ef8edc1378e3a027fecf7/flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java#L34

If you have file statistics collected for the table you can play around
with the watermark settings to create a bit of ordering during the reads:

https://iceberg.apache.org/docs/1.5.0/flink-queries/#emitting-watermarks


> Would appreciate any docs, pointers to the codebase that could help me
> understand the above.
>
> Thanks
> Chetas
>

Reply via email to