Hello,
I ran the TPC-DS benchmark using Metastore (in the traditional way) and Iceberg,
and would like to share the result for those interested in Hive using Iceberg.
The experiment used 1TB TPC-DS dataset stored as ORC.
Here are a few findings.
1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.
2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds
3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for the
first Map vertex.)
4. Out of 99 queries, 98 queries return correct results, but query 64 returns
wrong results (returning 0 rows) due to an exception:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002
--- Sungwoo