This is amazing @Ryan Blue <b...@tabular.io> & Dan, thank you for the detailed explanation.
On Fri, Jul 2, 2021 at 5:15 PM Ryan Blue <b...@tabular.io> wrote: > Dan's email gives a lot of great background on why row groups exist and > how they are still useful. I'd add that there are a few considerations for > choosing a row group size that can affect the choice of how many row groups > to target in a file. The first consideration is what Dan pointed out: > larger row groups lower potential parallelism. Also, the amount of memory > needed to write a file is proportional to the row group size, and the > memory needed to read a file is similar, although column projection can > help on the read side. Larger row groups use more memory. Large row groups > also require larger dictionaries to ensure columns are completely > dictionary-encoded. > > In short, larger row groups lead to lower parallelism, higher memory > consumption, and might fall back to plain encoding. On the other hand, you > do need row groups to be large enough to get all the columnar encoding > benefits that you can. You tend to get diminishing returns when increasing > the row group size, so I recommend setting the row group size to the > smallest value after which diminishing returns stop producing significantly > smaller files. For some datasets, that's 16 MB and for others it's 128MB or > higher. > > When considering a file size, you probably don't want to go as low as 16MB > because that's a lot more data for Iceberg to manage. I would probably set > file sizes at 1-2x the target split size, or larger if you want to keep > table metadata smaller. Setting the file size and row group size > independently might yield files that are 1 row group, but I don't see a big > advantage to purposely setting them the same. There are advantages to > having small row groups (more parallelism, better skipping in tasks) or > large files (less table-level metadata and distributed skipping). > > For most small tables, this probably doesn't matter too much but for large > tables, using Parquet stats to skip in tasks (distributed) rather than > keeping lots of metadata at the table level is a good capability. > > On Fri, Jul 2, 2021 at 4:23 PM Daniel Weeks <dwe...@apache.org> wrote: > >> Hey Sreeram, >> >> I feel like some of your points about why there are row groups are valid, >> but there are some really good reasons why you might want to have multiple >> row groups in a file (and I can share some situations where it can be >> valuable). >> >> If you think historically about how distributed processing worked, it was >> breaking files up (rather than processing one big file) to achieve >> parallelism. However, even in that case, you may want to either combine or >> split files to achieve some desired amount of parallelism. A lot of files >> are splittable either arbitrarily or by rather granular levels (e.g. text >> files or Sequence files). However, columnar formats don't have that same >> luxury due to the physical layout and boundaries become the product of a >> batch (or group) of records. This is where the row group concept comes >> from. While it does help in many ways to align with HDFS blocks, a row >> group can be considered the smallest indivisible (non-splittable) unit >> within a parquet file. This means you cannot efficiently split a row >> group, but if you have multiple row groups per file, you can maintain large >> files, but maintain flexibility with how you combine or split them to >> achieve desired parallelism. >> >> This has a number of applications. For example, if you want fewer files >> overall (e.g., due to the size of a dataset) you could have 128MB row >> groups, but 1GB files. This would allow you to split that file and >> individual tasks would only be required to process the size of the row >> group. Alternatively, if you made the entire file a row group, a task >> would be required to process the full 1GB. This will have an impact on how >> you can leverage parallelism to improve overall wall-clock performance. >> Iceberg even improces upon traditional split planning because it tracks the >> row group boundaries in iceberg metadata for more efficient planning. >> >> As for the two advantages you propose, I'm not really sure the first >> holds true since the parquet footer holds all of the necessary offsets to >> load the correct row groups and columns offsets. I don't feel like the >> complexity really changes significantly and to be efficient, you want to >> selectively load only the ranges you need. >> >> The second point about pruning also has advantages for multiple row >> groups because the stats are stored in the footer as well and can be >> processed across all the row groups so you can prune out specific row >> groups within the file (one row group per file would actually be redundant >> with the stats that iceberg keeps). >> >> There are actually a number of real world cases where we have reduced the >> row group size down to 16MB so that we can increases the parallelism and >> take more advantage of pruning at multiple levels to get answers faster. >> At the same time, you wouldn't want to necessarily reduce the file size to >> achieve the same result because it will produce significantly more files. >> >> Just some things to consider before you head down that path, >> -Dan >> >> >> >> On Thu, Jul 1, 2021 at 2:38 PM Sreeram Garlapati <gsreeramku...@gmail.com> >> wrote: >> >>> Hello Iceberg devs, >>> >>> We are leaning towards having 1 RowGroup Per File. We would love to know >>> if there are any additional considerations - that we potentially would have >>> missed. >>> >>> *Here's my understanding on How/Why Parquet historically needed to hold >>> multiple Row Groups - more like the major reason:* >>> >>> 1. HDFS had a single name node. This created a bottleneck - for the >>> operation handled by name node (i.e., maintaining that file address >>> table) >>> w.r.to resolving file name to location. So, naturally HDFS world >>> created very large file sizes - in GBs. >>> 2. So, in that world, to keep the file scan efficient - RowGroups >>> were introduced - so that Stats can be maintained within a given file - >>> which can help push the predicates down inside a given file to >>> optimize/avoid full file scan, where applicable. Rowgroups are also >>> configured to the size of HDFS block size to keep the reads/seeks >>> efficient. >>> >>> *Here's why I feel this additional RowGroup concept is redundant:* >>> In the new world where storage layer is housed in Cloud Blob stores - >>> this bottleneck on file address tables is no longer present - as - behind >>> the scenes it is typically a distributed hash table. >>> ==> So, modelling a very large file is NOT a requirement anymore. >>> ==> This concept of File having Multiple RowGroups - is not really >>> useful. >>> ==> we might very well simply create 1 Rowgroup per File >>> ==> & ofcourse, we will still need to create reasonably big file sizes >>> (for ex: 256mb) depending on the overall data on a given table - to let >>> columnar/rle goodness kick-in. >>> >>> Added advantages of this are: >>> >>> 1. breaking down a v.large file into pieces to upload and download >>> from filestores needs state maintenance at client and service which makes >>> it complex & errorprone. >>> 2. having only file level stats also puts the Iceberg metadata layer >>> into very good use w.r.to file pruning. >>> >>> Due to the above reasons - we are leaning towards creating 1 RowGroup >>> per File - when we are creating the iceberg table. >>> >>> Would love to know your thoughts! >>> Sreeram >>> >> > > -- > Ryan Blue > Tabular >