Hello Gopal,
I have been looking further into this issue, and have found that the
non-determinstic behavior of Hive in
generating DAGs is actually due to the logic in
AggregateStatsCache.findBestMatch() called from
AggregateStatsCache.get(), as well as the disproportionate distribution of
Nulls in
> My conclusion is that a query can update some internal states of HiveServer2,
> affecting DAG generation for subsequent queries.
Other than the automatic reoptimization feature, there's two other potential
suspects.
First one would be to disable the in-memory stats cache's variance param, wh
Hello Zoltan,
I further tested, and found no Exception (such as
MapJoinMemoryExhaustionError) during the run. So, the query ran fine. My
conclusion is that a query can update some internal states of HiveServer2,
affecting DAG generation for subsequent queries. Moreover, the same query
may or may n
Hello Sungwoo!
I think its possible that reoptimization is kicking in, because the first
execution have bumped into an exception.
I think the plans should not be changing permanently; unless
"hive.query.reexecution.stats.persist.scope" is set to a wider scope than query.
To check that indeed
Hello,
I am running the TPC-DS benchmark using Hive 3.0, and I find that Hive
sometimes produces different DAGs from the same query. These are the two
scenarios for the experiment. The execution engine is tez, and the TPC-DS
scale factor is 3TB.
1. Run query 19 to query 24 sequentially in the sam