For correctness, we can merge a few pull requests and change the default values
of a few configuration parameters, so that we can get the correct results for
the TPC-DS benchmark.
Another issue is a performance regression when compared with Hive 3.1. I ran the
TPC-DS benchmark using a scale factor of 10TB. Our internal testing shows that
the current snapshot of Hive 4 is 1.5 times slower than Hive 3.1. Here is a
summary of our internal testing on a cluster with 13 nodes, each with 256GB
memory and 6 SSDs.
Systems compared:
1. Trino 417 (using Java 11)
2. Hive 3.1 (a fork maintained by us)
3. Hive 4.0.0-SNAPSHOT (as of February 2023)
Results:
1. Trino 417
total execution time = 9633 seconds, geometric mean = 28.19 seconds
query 21 returns wrong results.
query 23 returns wrong results.
query 72 fails (with query.max-memory = 1440GB)
2. Hive 3.1
total execution time = 9900 seconds, geometric mean = 31.67 seconds
All the 99 queries return correct results.
3. Hive 4.0.0-SNAPSHOT
total execution time = 10584 seconds, geometric mean = 43.72 seconds
All the 99 queries return correct results.
Around the summer 2020, Hive 4.0.0-SNAPSHOT was noticeably faster than Hive 3.1,
although a few queries returned wrong results.
Not sure about how to fix the performance regression. Git bisecting is not a
practical option because 1) until last year, building 4.0.0-SNAPSHOT was not
smooth because of Tez dependency; 2) loadig 10TB TPC-DS data for each commit is
too much an overhead.
I am thinking about comparing DAG plans from Hive 3.1 and 4.0.0-SNAPSHOT for
those queries that exhibit performance regression. If you have any suggestion,
please let me know.
--- Sungwoo
On Tue, 21 Mar 2023, Stamatis Zampetakis wrote:
Many thanks for running tests with 4.0.0 Sungwoo; it is invaluable
help for getting out a stable Hive 4.
I will review https://issues.apache.org/jira/browse/HIVE-26968 in the
coming weeks; I have assigned myself as reviewer in the PR.
Can some other people (committers or not) help in reviewing the
remaining TPC-DS blockers for which we have a PR?
Reminder: Good non-binding reviews are important and much appreciated
by the community. They are also among the important metrics for
becoming a Hive committer/PMC [1].
Best,
Stamatis
[1] https://cwiki.apache.org/confluence/display/Hive/BecomingACommitter
On Tue, Mar 14, 2023 at 12:07?PM Sungwoo Park <c...@pl.postech.ac.kr> wrote:
Hello,
I would like to expand the list of blockers with HIVE-27138 [1] which fixes NPE
on mapjoin_filter_on_outerjoin.q.
Currently mapjoin_filter_on_outerjoin.q is tested with MapReduce execution
engine and shows no problem. However, it shows a few problems when tested with
Tez execution engine. HIVE-27138 is the first fix found after analyzing
mapjoin_filter_on_outerjoin.q, and Seonggon will create a couple more tickets
later.
In the meanwhile, it would be great if someone could review pull requests for
subtasks in HIVE-26654. (I moved to HIVE-26654 three tickets that I previously
requested code review for.)
Best,
--- Sungwoo
[1] https://issues.apache.org/jira/browse/HIVE-27138
On Fri, 10 Mar 2023, Stamatis Zampetakis wrote:
Hi Kirti,
Thanks for bringing up this topic.
The master branch already has many new features; we don't need to wait for
more to cut a GA.
The main criterion for going GA is stability thus I would consider
regressions as the only blockers for the release.
If I recall well the only regressions discovered so far are some problems
with TPC-DS queries so basically HIVE-26654 [1].
I will let others chime in to include more tickets if necessary.
Best,
Stamatis
[1] https://issues.apache.org/jira/browse/HIVE-26654
On Wed, Mar 8, 2023 at 10:02?AM Kirti Ruge <kirtirug...@gmail.com> wrote:
Hello Hive Dev,
It has been about 6 months since Hive-4.0-alpha-2 was released in Nov 2022.
Would it be a good time to discuss about HIVE-4.0 GA release to the
community ? Can we have discussion on the new features/jdk support versions
which we want to publish as part of 4.0 GA , timeframe of release.
Thanks,
Kirti