Re: Hive Query Performance Tuning

Matthew Dixon Tue, 03 Dec 2019 02:10:47 -0800

Hi Rajbir, some thoughts to consider,

I’m wondering what the row_number() functionality is doing.  Because the window 
frame has no ORDER BY clause the result may not be deterministic, is this the 
expected behaviour?  I ask because analytic functions can be expensive to 
compute so make sure you definitely need it.  Are you specifically trying to 
remove the first event from all entityID’s history (rank > 1 on lines 116 and 
181) or are you trying to remove duplicates perhaps?  What is the cardinality 
of the entityID fields you’re using to calculate it?  If the cardinality is low 
relative to the overall row count this would definitely impact runtime since 
all data for a single entity must go to the same task.


I’m assuming you’ve got hive.exec.parallel set to something appropriate here?

You’re using the IN keyword to pass a list of entityIDs in the second and third 
queries in the UNION (lines 109 and 174).  You may try rewriting the query to 
JOIN to these since I’m not sure if the map-side join optimisation will be 
leveraged using IN-syntax.

As a general piece of advice I find that watching a Hive application unfold on 
Resource Manager and looking at if certain tasks or stages take longer can be a 
good way to understand what aspects of the query are most expensive to compute.

Matt


From: Rajbir singh <rajbirsm...@gmail.com>
Reply to: "user@hive.apache.org" <user@hive.apache.org>
Date: Tuesday, 3 December 2019 at 09:25
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: Hive Query Performance Tuning

Hi All,

I have a hive query which does the aggregation of amounts by reading from hive 
tables and loads the results to another hive table.
I am trying to fine tune the attached query. Read online and came up with 
following. Any Ideas I would be really appreciate. Thank you


      1. Indexing:- We can create Index on the tables. (some folks says index 
actually make it worse) 
https://community.cloudera.com/t5/Support-Questions/Creating-Indexes-in-Hive/td-p/149601<https://community.cloudera.com/t5/Support-Questions/Creating-Indexes-in-Hive/td-p/149601>

      2. Execution :- Right now the hive queries run mapreduce engine. We can 
set the execution engine to Tez for the improved performance (looks like 
Cloudera doesn't support Tez ) 
https://community.cloudera.com/t5/Support-Questions/Tez-Engine-not-working-over-CDH-5-8-2/td-p/49477<https://community.cloudera.com/t5/Support-Questions/Tez-Engine-not-working-over-CDH-5-8-2/td-p/49477>

      3. Bucketing: improves the join performance if the bucket key and join 
keys are common. Bucketing in Hive distributes the data in different buckets    
    based on the hash results on the bucket key. It also reduces the I/O scans 
during the join process if the process is happening on the same keys        
(columns).
            SET hive.enforce.bucketing=true;
            SET hive.optimize.bucketmapjoin=true.

       4. Cost-Based Optimization in Hive (CBO)
           before submitting for final execution Hive optimizes each Query’s 
logical and physical execution plan
           However, CBO, performs, further optimizations based on query cost in 
a recent addition to Hive. That results in potentially different decisions: how 
to order joins, which type of join to                         perform, the 
degree of parallelism and others.
           set hive.cbo.enable=true;
           set hive.compute.query.using.stats=true;
           set hive.stats.fetch.column.stats=true;
           set hive.stats.fetch.partition.stats=true;


       5. Vectorization In Hive
To improve the performance of operations we can use Vectorized query execution. 
It happens by performing them in batches of 1024 rows at once instead of single 
row each time.It<http://time.It> significantly improves query execution time, 
and is easily enabled with two parameters settings
           set hive.vectorized.execution = true
           set hive.vectorized.execution.enabled = true


Bucketing parameters already set to true
Cost-Based Optimization in Hive (CBO) parameters set to true
Vectorization parameters set to true


I am not sure what else I can do to make the query work faster

--
Regards,
Rajbir

Disclaimer

The sender does not guarantee that this message, including any attachment, is 
secure or virus free. Also, it is confidential and may be privileged or 
otherwise protected from disclosure. 
If you are not the intended recipient, do not disclose or copy it or its 
contents. Please telephone or email the sender and delete the message entirely 
from your system. 
No binding obligations or payment commitments are to be derived from the 
contents of this email unless and until a clear written agreement containing 
all the necessary terms and conditions is properly executed. 
Jagex Limited is a company registered in England & Wales with company number 
03982706 and a registered office at 220 Science Park, Milton Road, Cambridge, 
CB4 0WA, UK.

Re: Hive Query Performance Tuning

Reply via email to