Hello Iceberg devs,

We are leaning towards having 1 RowGroup Per File. We would love to know if
there are any additional considerations - that we potentially would have
missed.

*Here's my understanding on How/Why Parquet historically needed to hold
multiple Row Groups - more like the major reason:*

   1. HDFS had a single name node. This created a bottleneck - for the
   operation handled by name node (i.e., maintaining that file address table)
   w.r.to resolving file name to location. So, naturally HDFS world created
   very large file sizes - in GBs.
   2. So, in that world, to keep the file scan efficient - RowGroups were
   introduced - so that Stats can be maintained within a given file - which
   can help push the predicates down inside a given file to optimize/avoid
   full file scan, where applicable. Rowgroups are also configured to the size
   of HDFS block size to keep the reads/seeks efficient.

*Here's why I feel this additional RowGroup concept is redundant:*
In the new world where storage layer is housed in Cloud Blob stores - this
bottleneck on file address tables is no longer present - as - behind the
scenes it is typically a distributed hash table.
==> So, modelling a very large file is NOT a requirement anymore.
==> This concept of File having Multiple RowGroups - is not really useful.
==> we might very well simply create 1 Rowgroup per File
==> & ofcourse, we will still need to create reasonably big file sizes (for
ex: 256mb) depending on the overall data on a given table - to let
columnar/rle goodness kick-in.

Added advantages of this are:

   1. breaking down a v.large file into pieces to upload and download from
   filestores needs state maintenance at client and service which makes it
   complex & errorprone.
   2. having only file level stats also puts the Iceberg metadata layer
   into very good use w.r.to file pruning.

Due to the above reasons - we are leaning towards creating 1 RowGroup per
File - when we are creating the iceberg table.

Would love to know your thoughts!
Sreeram

Reply via email to