Hello Iceberg devs, We are leaning towards having 1 RowGroup Per File. We would love to know if there are any additional considerations - that we potentially would have missed.
*Here's my understanding on How/Why Parquet historically needed to hold multiple Row Groups - more like the major reason:* 1. HDFS had a single name node. This created a bottleneck - for the operation handled by name node (i.e., maintaining that file address table) w.r.to resolving file name to location. So, naturally HDFS world created very large file sizes - in GBs. 2. So, in that world, to keep the file scan efficient - RowGroups were introduced - so that Stats can be maintained within a given file - which can help push the predicates down inside a given file to optimize/avoid full file scan, where applicable. Rowgroups are also configured to the size of HDFS block size to keep the reads/seeks efficient. *Here's why I feel this additional RowGroup concept is redundant:* In the new world where storage layer is housed in Cloud Blob stores - this bottleneck on file address tables is no longer present - as - behind the scenes it is typically a distributed hash table. ==> So, modelling a very large file is NOT a requirement anymore. ==> This concept of File having Multiple RowGroups - is not really useful. ==> we might very well simply create 1 Rowgroup per File ==> & ofcourse, we will still need to create reasonably big file sizes (for ex: 256mb) depending on the overall data on a given table - to let columnar/rle goodness kick-in. Added advantages of this are: 1. breaking down a v.large file into pieces to upload and download from filestores needs state maintenance at client and service which makes it complex & errorprone. 2. having only file level stats also puts the Iceberg metadata layer into very good use w.r.to file pruning. Due to the above reasons - we are leaning towards creating 1 RowGroup per File - when we are creating the iceberg table. Would love to know your thoughts! Sreeram