Re: iceberg and s3a compatibility

russell . spitzer Tue, 11 Jul 2023 15:17:54 -0700

The long story short is that Iceberg itself is a commit protocol. So you don’t 
have to configure any Hadoop commit protocols. Iceberg doesn’t use those 
methods because its metadata structure doesn’t rely on the location of data 
files as information about the state of those files. It can just write files 
directly into their final location. Only when the metadata is updated are those 
files actually live. Check the docs for intro blogs and videos.



As for configuration iceberg uses its own Parquet and ORC writing libraries, so 
none of the spark properties will actually work. You don’t need any of the 
spark ones either and the defaults are adequate for most use cases. Check the 
iceberg documentation for more information of properties for configuring 
icebergs’s parquet and ORC writers.


Sent from my iPhone

> On Jul 11, 2023, at 11:59 AM, Perfect Stranger <paulpaul1...@gmail.com> wrote:
> 
> 
> Hello. I am currently reading this: 
> https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html
>  and learning about the s3a committers.
> 
> It's a bit confusing and it seems like you need to be an expert in order to 
> properly use these committers. Because you don't just write to an s3a path 
> and use standard spark configs, you also need to provide configs for the s3a 
> committers...
> 
> I also saw this: https://github.com/rdblue/s3committer and it says that 
> people should just use iceberg. Does that mean that with iceberg you just 
> write to an s3a path and you don't have to specify which committer 
> (partitioned, directory, magic) to use and everything works optimally? 
> 
> Does iceberg have its own committers or something? I know that s3a's staging 
> committers, for example, require big enough local storage and hdfs, while 
> s3a's magic committer doesn't... Which makes me wonder if iceberg has any 
> requirements also...
> 
> Spark also has this guide:
> https://spark.apache.org/docs/latest/cloud-integration.html
> 
> in which it recommends these settings for parquet:
> spark.hadoop.parquet.enable.summary-metadata false
> spark.sql.parquet.mergeSchema false
> spark.sql.parquet.filterPushdown true
> spark.sql.hive.metastorePartitionPruning true
> And these settings for orc:
> spark.sql.orc.filterPushdown true
> spark.sql.orc.splits.include.file.footer true
> spark.sql.orc.cache.stripe.details.size 10000
> spark.sql.hive.metastorePartitionPruning true
> Should I specify these settings when using parquet or orc with iceberg in 
> spark?
> 
> Thank you.

Re: iceberg and s3a compatibility

Reply via email to