Re: Executing Hive Queries in Parallel

Manish Malhotra Sun, 27 Apr 2014 15:01:21 -0700

What Sanjay and Swagatika replied are perfect.

Plus fundamentally if you see, if you are able to run the hive query from
CLI or some internal API like HiveDriver, the flow will be this:


>> Compile the query
>> Get the info from Hive Metastore using Thrift or JDBC, Optimize it ( if
required and can do)
>> Generate the Java MR code.
>> Push the jobs ( might need to execute more then 1 in sequence) to the
JobTracker
Now the final step make sure that these MR job runs in parallel based on
the Queue and availability of the MR slots on the cluster.

So, irrespective you are running query using nohup hive -q or from multiple
machines or Oozie or Your custom code.
It boils down to your system/code is not submitting query in sequence or
not waiting and your cluster has enough resource to run MR in parallel.

Regards,
Manish



On Sun, Apr 27, 2014 at 1:58 PM, Swagatika Tripathy <swagatikat...@gmail.com
> wrote:

> Hi,
> You can also use oozie's fork fearure  which acts as a workflow scheduler
> to run jobs in parallel. You just need to define all our hql's inside the
> workflow.XML to make it run in parallel.
> On Apr 22, 2014 3:14 AM, "Subramanian, Sanjay (HQP)" <
> sanjay.subraman...@roberthalf.com> wrote:
>
>>   Hey
>>
>>  Instead of going into HIVE CLI
>>  I would propose 2 ways
>>
>>  *NOHUP *
>>  nohup hive -f path/to/query/file/*hive1.hql* >> ./hive1.hql_`date
>> +%Y-%m-%d-%H–%M–%S`.log 2>&1
>>  nohup hive -f path/to/query/file/*hive2.hql* >> ./hive2.hql_`date
>> +%Y-%m-%d-%H–%M–%S`.log 2>&1
>>  nohup hive -f path/to/query/file/*hive3.hql* >> ./hive3.hql_`date
>> +%Y-%m-%d-%H–%M–%S`.log 2>&1
>>  nohup hive -f path/to/query/file/*hive4.hql* >> ./hive4.hql_`date
>> +%Y-%m-%d-%H–%M–%S`.log 2>&1
>>  nohup hive -f path/to/query/file/*hive5.hql* >> ./hive5.hql_`date
>> +%Y-%m-%d-%H–%M–%S`.log 2>&1
>>
>>  Each statement above will launch MR jobs on your cluster and depending
>> on the cluster configs the jobs will run parallelly
>>  Scheduling jobs on the MR cluster is independent of Hive
>>
>>  *SCREEN sessions*
>>
>>    - Create a Screen session
>>       - screen  –S  hive_query1
>>       - U r inside the screen session hive_query1
>>          - hive -f path/to/query/file/*hive1.hql*
>>       - Ctrl A D
>>          - U detach from a screen session
>>        - Repeat for each hive query u want to run
>>       - I.e. Say 5 screen sessions, each running a have query
>>    - To display screen session active
>>       - screen -x
>>    - To attach to a screen session
>>       - screen  -x hive_query1
>>
>>
>>  Thanks
>>
>> Warm Regards
>>
>>
>>  Sanjay
>>
>>
>>    From: saurabh <mpp.databa...@gmail.com>
>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> Date: Monday, April 21, 2014 at 1:53 PM
>> To: "user@hive.apache.org" <user@hive.apache.org>
>> Subject: Executing Hive Queries in Parallel
>>
>>
>>  Hi,
>>  I need some inputs to execute hive queries in parallel. I tried doing
>> this using CLI (by opening multiple ssh connection) and executed 4 HQL's;
>> it was observed that the queries are getting executed sequentially. All the
>> FOUR queries got submitted however while the first one was in execution
>> mode the other were in pending state. I was performing this activity on the
>> EMR running on Batch mode hence didn't able to dig into the logs.
>>
>>  The hive CLI uses native hive connection which by default uses the FIFO
>> scheduler.  This might be one of the reason for the queries getting
>> executed in sequence.
>>
>>  I also observed that when multiple queries are executed using multiple
>> HUE sessions, it provides the parallel execution functionality. Can you
>> please suggest how the functionality of HUE can be replicated using CLI?
>>
>>  I am aware of beeswax client however i am not sure how this can be used
>> during EMR- batch mode processing.
>>
>>  Thanks in advance for going through this. Kindly let me know your
>> thoughts on the same.
>>
>>

Re: Executing Hive Queries in Parallel

Reply via email to