Hi Hive users,

I am evaluating the performance of Hive 4.0.1 and 4.1, and I wonder if any
team has deployed Hive-LLAP in production (using vanilla Hive) and observed
a similar level of performance improvement when switching from Hive-Tez to
Hive-LLAP.

In the attached Excel document, I report the result of running the 10TB
TPC-DS benchmark in a 13-node cluster using the following systems,
including two reference systems:

1. Hive 4.0.1 on Tez, Java 8
2. Hive-LLAP 4.0.1, Java 8
3. Hive 4.1 on Tez, Java 22
4. Hive-LLAP 4.1, Java 22
5. HDP 3.1.0, Java 8 (Hortonworks Data Platform, based on Hive 3.1 with
lots of patches backported)
6. Spark 4.0.0, Java 22

>From Hive 4.0.1 on Tez vs Hive-LLAP 4.0.1, the total running time decreases
from 12706s to 10019s (about 20%), which is not quite impressive,
especially considering the overhead of creating worker containers for each
query in Hive on Tez. On the other hand, the geometric mean decreases from
56s to 31s, so the result seems to be reasonable.

For comparison, HDP 3.1.0, which was released more than 5 years ago and is
based on Hive 3.1, finishes in 9158s. (You can add a few hundred seconds
because it uses a slight variant of the TPC-DS benchmark.)

A similar observation can be made on Hive 4.1. The total running time
decreases from 10912s to 8262s (about 25%), and the geometric mean
decreases from 50s to 26.9s.

Note that Hive-LLAP 4.1 is faster than HDP 3.1.0, but it uses Java 22
instead of Java 8. (Query 72 fails due to MapJoinMemoryExhaustionError.)

So, do you think that this result aligns with your expectations for
Hive-Tez vs Hive-LLAP?

Thanks,

Sungwoo

Attachment: tpcds.tez.xlsx
Description: MS-Excel 2007 spreadsheet

Reply via email to