I'm not aware of any particular reason that this shouldn't "inherently" work, but for debugging purposes I'd be wondering about the nested environment variables related to the hadoop job.....the bash shell where you are trying to launch subsequent hive queries already has pre-existing hadoop job environment variables declared in the environment from the parent streaming job.....I can't say for sure that there wouldn't be conflicts there. So while I don't know of any reason that it definitely won't work, I know that you are venturing into uncharted territory and you may uncover unexpected edge-cases.
From: Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com] Sent: Monday, April 18, 2016 3:44 PM To: user@hive.apache.org Subject: Re: Mappers spawning Hive queries I am using Hive 1.2.1 with MR backend. Ryan, I hear you. I totally agree. This is not the best approach, and I am in fact restructuring the approach. However, I would like to understand what is going on. In my test run, each hive query is computing distinct on a toy table of 10 records -- so, we are definitely not running into problems like resource contention. Also, I increased (streaming) mappers' task timeout value (to 1hr) so that I give ample time for shell script (i.e., hive query) to finish. So, architecturally, is there something that limits us spawning such nested MR jobs -- a streaming MR job spawning multiple hive queries that in turn spawn mr jobs. Shirish On Mon, Apr 18, 2016 at 1:31 PM, Ryan Harris <ryan.har...@zionsbancorp.com<mailto:ryan.har...@zionsbancorp.com>> wrote: My $0.02.... If you are running multiple concurrent queries on the data, you are probably doing it wrong (or at least inefficiently)....although this somewhat depends on what type of files are backing your hive warehouse... Let's assume that your data is NOT backed by ORC/parquet files, and that you are NOT using Tez/Spark as your execution engine.... Generally with HDFS, data I/O is going to be the slowest piece....so, with your workflow, each hive query is going to need to read ALL of the source data to resolve the query. It would be much more efficient if you could write a more complex query that could read the source data 1 time (instead of however many parallel operations you are running)....Additionally, from an efficiency perspective running queries in parallel might only help improve performance if each of your queries requires fewer map tasks than the total capacity of your cluster....otherwise it would generally be more efficient to run your queries in series. If you schedule the work in series, and things get backed up, the job will still eventually complete. If you attempt to do TOO much work in parallel, all of the jobs will start timing out and then everything will fail. There may be a valid reason for approaching the problem the way that you are, but I'd encourage you to look at restructuring your approach to the problem to save you more headaches down the road. From: Shirish Tatikonda [mailto:shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>] Sent: Monday, April 18, 2016 2:00 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Mappers spawning Hive queries Hi John, 2) The shell script is invoked in the mappers of a Hadoop streaming job. 1) The use case is that I have to process multiple entities in parallel. Each entity is associated with its own data set. The processing involves a few hive queries to do joins and aggregations, which is followed by some code in Python. My thought process is to put the hive queries and python invocation in a shell script, and invoke the shell script on multiple entities in parallel through a streaming mapreduce job. Shirish On Sat, Apr 16, 2016 at 12:10 AM, Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote: Just out of curiosity, what is the use case behind this? How do you call the shell script? > On 16 Apr 2016, at 00:24, Shirish Tatikonda > <shirish.tatiko...@gmail.com<mailto:shirish.tatiko...@gmail.com>> wrote: > > Hello, > > I am trying to run multiple hive queries in parallel by submitting them > through a map-reduce job. > More specifically, I have a map-only hadoop streaming job where each mapper > runs a shell script that does two things -- 1) parses input lines obtained > via streaming; and 2) submits a very simple hive query (via hive -e ...) with > parameters computed from step-1. > > Now, when I run the streaming job, the mappers seem to be stuck and I don't > know what is going on. When I looked on resource manager web UI, I don't see > any new MR Jobs (triggered from the hive query). I am trying to understand > this behavior. > > This may be a bad idea to begin with, and there may be better ways to > accomplish the same task. However, I would like to understand the behavior of > such a MR job. > > Any thoughts? > > Thank you, > Shirish > ________________________________ THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you. ====================================================================== THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you.