As Ryan & Dan said, the trade offs are roughly: bigger parquet row groups & orc stripes: * better compression * fewer read operations * lower file metadata overhead * fewer files to manage
smaller row groups/stripes: * better parallelism * lower memory usage Some of the worst performing tables that I've seen are ones where the user configured the table with too many partitions or buckets. .. Owen On Tue, Jul 6, 2021 at 11:45 AM Sreeram Garlapati <gsreeramku...@gmail.com> wrote: > This is amazing @Ryan Blue <b...@tabular.io> & Dan, thank you for the > detailed explanation. > > On Fri, Jul 2, 2021 at 5:15 PM Ryan Blue <b...@tabular.io> wrote: > >> Dan's email gives a lot of great background on why row groups exist and >> how they are still useful. I'd add that there are a few considerations for >> choosing a row group size that can affect the choice of how many row groups >> to target in a file. The first consideration is what Dan pointed out: >> larger row groups lower potential parallelism. Also, the amount of memory >> needed to write a file is proportional to the row group size, and the >> memory needed to read a file is similar, although column projection can >> help on the read side. Larger row groups use more memory. Large row groups >> also require larger dictionaries to ensure columns are completely >> dictionary-encoded. >> >> In short, larger row groups lead to lower parallelism, higher memory >> consumption, and might fall back to plain encoding. On the other hand, you >> do need row groups to be large enough to get all the columnar encoding >> benefits that you can. You tend to get diminishing returns when increasing >> the row group size, so I recommend setting the row group size to the >> smallest value after which diminishing returns stop producing significantly >> smaller files. For some datasets, that's 16 MB and for others it's 128MB or >> higher. >> >> When considering a file size, you probably don't want to go as low as >> 16MB because that's a lot more data for Iceberg to manage. I would probably >> set file sizes at 1-2x the target split size, or larger if you want to keep >> table metadata smaller. Setting the file size and row group size >> independently might yield files that are 1 row group, but I don't see a big >> advantage to purposely setting them the same. There are advantages to >> having small row groups (more parallelism, better skipping in tasks) or >> large files (less table-level metadata and distributed skipping). >> >> For most small tables, this probably doesn't matter too much but for >> large tables, using Parquet stats to skip in tasks (distributed) rather >> than keeping lots of metadata at the table level is a good capability. >> >> On Fri, Jul 2, 2021 at 4:23 PM Daniel Weeks <dwe...@apache.org> wrote: >> >>> Hey Sreeram, >>> >>> I feel like some of your points about why there are row groups are >>> valid, but there are some really good reasons why you might want to have >>> multiple row groups in a file (and I can share some situations where it can >>> be valuable). >>> >>> If you think historically about how distributed processing worked, it >>> was breaking files up (rather than processing one big file) to achieve >>> parallelism. However, even in that case, you may want to either combine or >>> split files to achieve some desired amount of parallelism. A lot of files >>> are splittable either arbitrarily or by rather granular levels (e.g. text >>> files or Sequence files). However, columnar formats don't have that same >>> luxury due to the physical layout and boundaries become the product of a >>> batch (or group) of records. This is where the row group concept comes >>> from. While it does help in many ways to align with HDFS blocks, a row >>> group can be considered the smallest indivisible (non-splittable) unit >>> within a parquet file. This means you cannot efficiently split a row >>> group, but if you have multiple row groups per file, you can maintain large >>> files, but maintain flexibility with how you combine or split them to >>> achieve desired parallelism. >>> >>> This has a number of applications. For example, if you want fewer files >>> overall (e.g., due to the size of a dataset) you could have 128MB row >>> groups, but 1GB files. This would allow you to split that file and >>> individual tasks would only be required to process the size of the row >>> group. Alternatively, if you made the entire file a row group, a task >>> would be required to process the full 1GB. This will have an impact on how >>> you can leverage parallelism to improve overall wall-clock performance. >>> Iceberg even improces upon traditional split planning because it tracks the >>> row group boundaries in iceberg metadata for more efficient planning. >>> >>> As for the two advantages you propose, I'm not really sure the first >>> holds true since the parquet footer holds all of the necessary offsets to >>> load the correct row groups and columns offsets. I don't feel like the >>> complexity really changes significantly and to be efficient, you want to >>> selectively load only the ranges you need. >>> >>> The second point about pruning also has advantages for multiple row >>> groups because the stats are stored in the footer as well and can be >>> processed across all the row groups so you can prune out specific row >>> groups within the file (one row group per file would actually be redundant >>> with the stats that iceberg keeps). >>> >>> There are actually a number of real world cases where we have reduced >>> the row group size down to 16MB so that we can increases the parallelism >>> and take more advantage of pruning at multiple levels to get answers >>> faster. At the same time, you wouldn't want to necessarily reduce the file >>> size to achieve the same result because it will produce significantly more >>> files. >>> >>> Just some things to consider before you head down that path, >>> -Dan >>> >>> >>> >>> On Thu, Jul 1, 2021 at 2:38 PM Sreeram Garlapati < >>> gsreeramku...@gmail.com> wrote: >>> >>>> Hello Iceberg devs, >>>> >>>> We are leaning towards having 1 RowGroup Per File. We would love to >>>> know if there are any additional considerations - that we potentially would >>>> have missed. >>>> >>>> *Here's my understanding on How/Why Parquet historically needed to hold >>>> multiple Row Groups - more like the major reason:* >>>> >>>> 1. HDFS had a single name node. This created a bottleneck - for the >>>> operation handled by name node (i.e., maintaining that file address >>>> table) >>>> w.r.to resolving file name to location. So, naturally HDFS world >>>> created very large file sizes - in GBs. >>>> 2. So, in that world, to keep the file scan efficient - RowGroups >>>> were introduced - so that Stats can be maintained within a given file - >>>> which can help push the predicates down inside a given file to >>>> optimize/avoid full file scan, where applicable. Rowgroups are also >>>> configured to the size of HDFS block size to keep the reads/seeks >>>> efficient. >>>> >>>> *Here's why I feel this additional RowGroup concept is redundant:* >>>> In the new world where storage layer is housed in Cloud Blob stores - >>>> this bottleneck on file address tables is no longer present - as - behind >>>> the scenes it is typically a distributed hash table. >>>> ==> So, modelling a very large file is NOT a requirement anymore. >>>> ==> This concept of File having Multiple RowGroups - is not really >>>> useful. >>>> ==> we might very well simply create 1 Rowgroup per File >>>> ==> & ofcourse, we will still need to create reasonably big file sizes >>>> (for ex: 256mb) depending on the overall data on a given table - to let >>>> columnar/rle goodness kick-in. >>>> >>>> Added advantages of this are: >>>> >>>> 1. breaking down a v.large file into pieces to upload and download >>>> from filestores needs state maintenance at client and service which >>>> makes >>>> it complex & errorprone. >>>> 2. having only file level stats also puts the Iceberg metadata >>>> layer into very good use w.r.to file pruning. >>>> >>>> Due to the above reasons - we are leaning towards creating 1 RowGroup >>>> per File - when we are creating the iceberg table. >>>> >>>> Would love to know your thoughts! >>>> Sreeram >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >