This is amazing @Ryan Blue <b...@tabular.io> & Dan, thank you for the
detailed explanation.

On Fri, Jul 2, 2021 at 5:15 PM Ryan Blue <b...@tabular.io> wrote:

> Dan's email gives a lot of great background on why row groups exist and
> how they are still useful. I'd add that there are a few considerations for
> choosing a row group size that can affect the choice of how many row groups
> to target in a file. The first consideration is what Dan pointed out:
> larger row groups lower potential parallelism. Also, the amount of memory
> needed to write a file is proportional to the row group size, and the
> memory needed to read a file is similar, although column projection can
> help on the read side. Larger row groups use more memory. Large row groups
> also require larger dictionaries to ensure columns are completely
> dictionary-encoded.
>
> In short, larger row groups lead to lower parallelism, higher memory
> consumption, and might fall back to plain encoding. On the other hand, you
> do need row groups to be large enough to get all the columnar encoding
> benefits that you can. You tend to get diminishing returns when increasing
> the row group size, so I recommend setting the row group size to the
> smallest value after which diminishing returns stop producing significantly
> smaller files. For some datasets, that's 16 MB and for others it's 128MB or
> higher.
>
> When considering a file size, you probably don't want to go as low as 16MB
> because that's a lot more data for Iceberg to manage. I would probably set
> file sizes at 1-2x the target split size, or larger if you want to keep
> table metadata smaller. Setting the file size and row group size
> independently might yield files that are 1 row group, but I don't see a big
> advantage to purposely setting them the same. There are advantages to
> having small row groups (more parallelism, better skipping in tasks) or
> large files (less table-level metadata and distributed skipping).
>
> For most small tables, this probably doesn't matter too much but for large
> tables, using Parquet stats to skip in tasks (distributed) rather than
> keeping lots of metadata at the table level is a good capability.
>
> On Fri, Jul 2, 2021 at 4:23 PM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Hey Sreeram,
>>
>> I feel like some of your points about why there are row groups are valid,
>> but there are some really good reasons why you might want to have multiple
>> row groups in a file (and I can share some situations where it can be
>> valuable).
>>
>> If you think historically about how distributed processing worked, it was
>> breaking files up (rather than processing one big file) to achieve
>> parallelism.  However, even in that case, you may want to either combine or
>> split files to achieve some desired amount of parallelism.  A lot of files
>> are splittable either arbitrarily or by rather granular levels (e.g. text
>> files or Sequence files).  However, columnar formats don't have that same
>> luxury due to the physical layout and boundaries become the product of a
>> batch (or group) of records.  This is where the row group concept comes
>> from.  While it does help in many ways to align with HDFS blocks, a row
>> group can be considered the smallest indivisible (non-splittable) unit
>> within a parquet file.  This means you cannot efficiently split a row
>> group, but if you have multiple row groups per file, you can maintain large
>> files, but maintain flexibility with how you combine or split them to
>> achieve desired parallelism.
>>
>> This has a number of applications.  For example, if you want fewer files
>> overall (e.g., due to the size of a dataset) you could have 128MB row
>> groups, but 1GB files.  This would allow you to split that file and
>> individual tasks would only be required to process the size of the row
>> group.  Alternatively, if you made the entire file a row group, a task
>> would be required to process the full 1GB.  This will have an impact on how
>> you can leverage parallelism to improve overall wall-clock performance.
>> Iceberg even improces upon traditional split planning because it tracks the
>> row group boundaries in iceberg metadata for more efficient planning.
>>
>> As for the two advantages you propose, I'm not really sure the first
>> holds true since the parquet footer holds all of the necessary offsets to
>> load the correct row groups and columns offsets.  I don't feel like the
>> complexity really changes significantly and to be efficient, you want to
>> selectively load only the ranges you need.
>>
>> The second point about pruning also has advantages for multiple row
>> groups because the stats are stored in the footer as well and can be
>> processed across all the row groups so you can prune out specific row
>> groups within the file (one row group per file would actually be redundant
>> with the stats that iceberg keeps).
>>
>> There are actually a number of real world cases where we have reduced the
>> row group size down to 16MB so that we can increases the parallelism and
>> take more advantage of pruning at multiple levels to get answers faster.
>> At the same time, you wouldn't want to necessarily reduce the file size to
>> achieve the same result because it will produce significantly more files.
>>
>> Just some things to consider before you head down that path,
>> -Dan
>>
>>
>>
>> On Thu, Jul 1, 2021 at 2:38 PM Sreeram Garlapati <gsreeramku...@gmail.com>
>> wrote:
>>
>>> Hello Iceberg devs,
>>>
>>> We are leaning towards having 1 RowGroup Per File. We would love to know
>>> if there are any additional considerations - that we potentially would have
>>> missed.
>>>
>>> *Here's my understanding on How/Why Parquet historically needed to hold
>>> multiple Row Groups - more like the major reason:*
>>>
>>>    1. HDFS had a single name node. This created a bottleneck - for the
>>>    operation handled by name node (i.e., maintaining that file address 
>>> table)
>>>    w.r.to resolving file name to location. So, naturally HDFS world
>>>    created very large file sizes - in GBs.
>>>    2. So, in that world, to keep the file scan efficient - RowGroups
>>>    were introduced - so that Stats can be maintained within a given file -
>>>    which can help push the predicates down inside a given file to
>>>    optimize/avoid full file scan, where applicable. Rowgroups are also
>>>    configured to the size of HDFS block size to keep the reads/seeks 
>>> efficient.
>>>
>>> *Here's why I feel this additional RowGroup concept is redundant:*
>>> In the new world where storage layer is housed in Cloud Blob stores -
>>> this bottleneck on file address tables is no longer present - as - behind
>>> the scenes it is typically a distributed hash table.
>>> ==> So, modelling a very large file is NOT a requirement anymore.
>>> ==> This concept of File having Multiple RowGroups - is not really
>>> useful.
>>> ==> we might very well simply create 1 Rowgroup per File
>>> ==> & ofcourse, we will still need to create reasonably big file sizes
>>> (for ex: 256mb) depending on the overall data on a given table - to let
>>> columnar/rle goodness kick-in.
>>>
>>> Added advantages of this are:
>>>
>>>    1. breaking down a v.large file into pieces to upload and download
>>>    from filestores needs state maintenance at client and service which makes
>>>    it complex & errorprone.
>>>    2. having only file level stats also puts the Iceberg metadata layer
>>>    into very good use w.r.to file pruning.
>>>
>>> Due to the above reasons - we are leaning towards creating 1 RowGroup
>>> per File - when we are creating the iceberg table.
>>>
>>> Would love to know your thoughts!
>>> Sreeram
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to