I'm not aware of any particular reason that this shouldn't "inherently" work, 
but for debugging purposes I'd be wondering about the nested environment 
variables related to the hadoop job.....the bash shell where you are trying to 
launch subsequent hive queries already has pre-existing hadoop job environment 
variables declared in the environment from the parent streaming job.....I can't 
say for sure that there wouldn't be conflicts there.  So while I don't know of 
any reason that it definitely won't work, I know that you are venturing into 
uncharted territory and you may uncover unexpected edge-cases.


From: Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com]
Sent: Monday, April 18, 2016 3:44 PM
To: user@hive.apache.org
Subject: Re: Mappers spawning Hive queries

I am using Hive 1.2.1 with MR backend.

Ryan, I hear you. I totally agree. This is not the best approach, and I am in 
fact restructuring the approach.

However, I would like to understand what is going on. In my test run, each hive 
query is computing distinct on a toy table of 10 records -- so, we are 
definitely not running into problems like resource contention. Also, I 
increased (streaming) mappers' task timeout value (to 1hr) so that I give ample 
time for shell script (i.e., hive query) to finish. So, architecturally, is 
there something that limits us spawning such nested MR jobs -- a streaming MR 
job spawning multiple hive queries that in turn spawn mr jobs.

Shirish


On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris 
<ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote:
My $0.02....

If you are running multiple concurrent queries on the data, you are probably 
doing it wrong (or at least inefficiently)....although this somewhat depends on 
what type of files are backing your hive warehouse...

Let's assume that your data is NOT backed by ORC/parquet files, and that you 
are NOT using Tez/Spark as your execution engine....

Generally with HDFS, data I/O is going to be the slowest piece....so, with your 
workflow, each hive query is going to need to read ALL of the source data to 
resolve the query.  It would be much more efficient if you could write a more 
complex query that could read the source data 1 time (instead of however many 
parallel operations you are running)....Additionally, from an efficiency 
perspective running queries in parallel might only help improve performance if 
each of your queries requires fewer map tasks than the total capacity of your 
cluster....otherwise it would  generally be more efficient to run your queries 
in series.

If you schedule the work in series, and things get backed up, the job will 
still eventually complete.  If you attempt to do TOO much work in parallel, all 
of the jobs will start timing out and then everything will fail.

There may be a valid reason for approaching the problem the way that you are, 
but I'd encourage you to look at restructuring your approach to the problem to 
save you more headaches down the road.

From: Shirish Tatikonda 
[mailto:shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>]
Sent: Monday, April 18, 2016 2:00 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Mappers spawning Hive queries

Hi John,

2) The shell script is invoked in the mappers of a Hadoop streaming job.

1) The use case is that I have to process multiple entities in parallel. Each 
entity is associated with its own data set. The processing involves a few hive 
queries to do joins and aggregations, which is followed by some code in Python. 
My thought process is to put the hive queries and python invocation in a shell 
script, and invoke the shell script on multiple entities in parallel through a 
streaming mapreduce job.

Shirish


On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Just out of curiosity, what is the use case behind this?

How do you call the shell script?

> On 16 Apr 2016, at 00:24, Shirish Tatikonda 
> <shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>> wrote:
>
> Hello,
>
> I am trying to run multiple hive queries in parallel by submitting them 
> through a map-reduce job.
> More specifically, I have a map-only hadoop streaming job where each mapper 
> runs a shell script that does two things -- 1) parses input lines obtained 
> via streaming; and 2) submits a very simple hive query (via hive -e ...) with 
> parameters computed from step-1.
>
> Now, when I run the streaming job, the mappers seem to be stuck and I don't 
> know what is going on. When I looked on resource manager web UI, I don't see 
> any new MR Jobs (triggered from the hive query). I am trying to understand 
> this behavior.
>
> This may be a bad idea to begin with, and there may be better ways to 
> accomplish the same task. However, I would like to understand the behavior of 
> such a MR job.
>
> Any thoughts?
>
> Thank you,
> Shirish
>

________________________________
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately. Thank you.


======================================================================
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.

Reply via email to