Hello,

I ran the TPC-DS benchmark using Metastore (in the traditional way) and Iceberg, and would like to share the result for those interested in Hive using Iceberg. The experiment used 1TB TPC-DS dataset stored as ORC.

Here are a few findings.

1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.

2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds

3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for the first Map vertex.)

4. Out of 99 queries, 98 queries return correct results, but query 64 returns wrong results (returning 0 rows) due to an exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002

--- Sungwoo



Reply via email to