Hi Chetas, See my answers below:
On Tue, Apr 16, 2024, 06:39 Chetas Joshi <chetas.jo...@gmail.com> wrote: > Hello, > > I am running a batch flink job to read an iceberg table. I want to > understand a few things. > > 1. How does the FlinkSplitPlanner decide which fileScanTasks (I think one > task corresponds to one data file) need to be clubbed together within a > single split and when to create a new split? > You can take a look at the generic read properties for Iceberg tables: https://iceberg.apache.org/docs/nightly/configuration/#read-properties The most interesting ones for you are: - read.split.target-size - read.split.metadata-target-size - read.split.planning-lookback - read.split.open-file-cost 2. When the number of task slots is limited, what is the sequence in which > the splits are assigned to the task slots? > For example, if there are 4 task slots available but the number of splits > (source parallelism) to be read is 8, which 4 splits will be sent to the > task slots first? Where in the codebase does this logic exist? > As a general rule, there is no pre-defined order between the splits, and because of the parallelism, the order of the records are not defined. It is a bit low level API, and might be removed in the future, but you can define your own comparator to order the splits: https://github.com/apache/iceberg/blob/fbcd142c5dc1ec99792ef8edc1378e3a027fecf7/flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java#L248 Or you can use the fileSequenceNumber comparator to order the splits based on the commit order: https://github.com/apache/iceberg/blob/fbcd142c5dc1ec99792ef8edc1378e3a027fecf7/flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/split/SplitComparators.java#L34 If you have file statistics collected for the table you can play around with the watermark settings to create a bit of ordering during the reads: https://iceberg.apache.org/docs/1.5.0/flink-queries/#emitting-watermarks > Would appreciate any docs, pointers to the codebase that could help me > understand the above. > > Thanks > Chetas >