Hello. I am currently reading this:
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html
and
learning about the s3a committers.

It's a bit confusing and it seems like you need to be an expert in order to
properly use these committers. Because you don't just write to an s3a path
and use standard spark configs, you also need to provide configs for the
s3a committers...

I also saw this: https://github.com/rdblue/s3committer and it says that
people should just use iceberg. Does that mean that with iceberg you just
write to an s3a path and you don't have to specify which committer
(partitioned, directory, magic) to use and everything works optimally?

Does iceberg have its own committers or something? I know that s3a's
staging committers, for example, require big enough local storage and hdfs,
while s3a's magic committer doesn't... Which makes me wonder if iceberg has
any requirements also...

Spark also has this guide:
https://spark.apache.org/docs/latest/cloud-integration.html

in which it recommends these settings for parquet:

spark.hadoop.parquet.enable.summary-metadata false
spark.sql.parquet.mergeSchema false
spark.sql.parquet.filterPushdown true
spark.sql.hive.metastorePartitionPruning true

And these settings for orc:

spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true

Should I specify these settings when using parquet or orc with iceberg in
spark?

Thank you.

Reply via email to