Re: help on failed MR jobs (big hive files)

2012-12-13 Thread Elaine Gan
Dear Nitin and Mark, Thanks for the prompt response. (1) optimizing the query I've tried using function in my where clause and just specify the date range. It still doesnt work, so i guess i need to check around. But still, this is a good advice. Will keep this in mind. (2) heapsize > > by the h

Re: A tool to analyze and tune performance for Hive?

2012-12-13 Thread Jie Li
Thanks for the pointer to HiTune. The dataflow graphs in the paper looks nice. The potential issues I can see: 1) the data collection requires a Chukwa cluster being set up. Seems too heavy-weight? 2) drill down analysis. Besides those graphs shown in the paper, can users further drill down to the

RE: A tool to analyze and tune performance for Hive?

2012-12-13 Thread Zheng, Kai
You may have a try for HiTune & HiBench. Just google for them. -Original Message- From: Jie Li [mailto:ji...@cs.duke.edu] Sent: Friday, December 14, 2012 10:02 AM To: user@hive.apache.org Subject: A tool to analyze and tune performance for Hive? Hi everyone, May I know if there is any t

A tool to analyze and tune performance for Hive?

2012-12-13 Thread Jie Li
Hi everyone, May I know if there is any tool available to analyze and tune the performance for Hive queries? And what is the state of the art? I had some experience on tuning Pig, based on manually clicking JT web pages and collecting pieces of data from here and there, and guessing what might be

Re: HiveQL include/load other hiveql script ?

2012-12-13 Thread Edward Capriolo
>From the cli the source command should read in another file. They can be nested I believe. On Thursday, December 13, 2012, Alexandre Fouche < alexandre.fou...@cleverscale.com> wrote: > Hi > Is there a HiveQL statement to load, import or execute another HiveQL script, either local or in HDFS ? > (

Re: map side join with group by

2012-12-13 Thread Chen Song
Thanks Nitin. This is all I want to clarify :) Chen On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar wrote: > to improve the speed of the job they created map only joins so that all > the records associated with a key fall to a map .. reducers slows it down. > If the reducer has to do some more job

Re: map side join with group by

2012-12-13 Thread Nitin Pawar
to improve the speed of the job they created map only joins so that all the records associated with a key fall to a map .. reducers slows it down. If the reducer has to do some more job then they launch another job. bear in mind, when we say map only join we are absolutely sure that speed will inc

Re: map side join with group by

2012-12-13 Thread Chen Song
Nitin Yeah. My original question is that is there a way to force Hive (or rather to say, is it possible) to execute map side join at mapper phase and group by in reduce phase. So instead of launching a map only job (join) and map reduce job (group by), doing it altogether in a single MR job. This

Re: map side join with group by

2012-12-13 Thread Nitin Pawar
chen in mapside join .. there are no reducers .. its MAP ONLY job On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote: > Understood that fact that it is impossible in the same MR job if both join > and group by are gonna happen in the reduce phase (because the join keys > and group by keys are di

Re: Map side join

2012-12-13 Thread Souvik Banerjee
Thanks for the help. What I did earlier is that I changed the configuration in HDFS and created the table. I expected that the block size of the new Table to be of 32 MB. But I found that while using Cloudera Manager you need to deploy Change in Configuration of both the HDFS and Mapreduce. (I did

Re: map side join with group by

2012-12-13 Thread Chen Song
Understood that fact that it is impossible in the same MR job if both join and group by are gonna happen in the reduce phase (because the join keys and group by keys are different). But for map side join, the joins would be complete by the end of the map phase, and outputs should be ready to be dis

Re: Map side join

2012-12-13 Thread bejoy_ks
Hi Souvik To have the new hdfs block size in effect on the already existing files, you need to re copy them into hdfs. To play with the number of mappers you can set lesser value like 64mb for min and max split size. Mapred.min.split.size and mapred.max.split.size Regards Bejoy KS Sent from

Re: Map side join

2012-12-13 Thread Souvik Banerjee
Hi Bejoy, The input files are non-compressed text file. There are enough free slots in the cluster. Can you please let me know can I increase the no of mappers? I tried reducing the HDFS block size to 32 MB from 128 MB. I was expecting to get more mappers. But still it's launching same no of mapp

Re: Map side join

2012-12-13 Thread bejoy_ks
Hi Souvik Is your input files compressed using some non splittable compression codec? Do you have enough free slots while this job is running? Make sure that the job is not running locally. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Souvik

Re: map side join with group by

2012-12-13 Thread Nitin Pawar
Thats because for the first job the join keys are different and second job group by keys are different, you just cant assume join keys and group keys will be same so they are two different jobs On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote: > Yeah, my abridged version of query might be a lit

Re: map side join with group by

2012-12-13 Thread Chen Song
Yeah, my abridged version of query might be a little broken but my point is that when a query has a map join and group by, even in its simplified incarnation, it will launch two jobs. I was just wondering why map join and group by cannot be accomplished in one MR job. Best, Chen On Thu, Dec 13, 2

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

2012-12-13 Thread imen Megdiche
Ok, but if i would search a solution faor warehousing big data, it s rather hive a best solution actually. I know that facebook uses Hive. 2012/12/13 Mohammad Tariq > I said that because under the hood each query(Hive or Pig) gets converted > into a MapReduce job first, and gives you the result

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

2012-12-13 Thread Mohammad Tariq
I said that because under the hood each query(Hive or Pig) gets converted into a MapReduce job first, and gives you the result. Regards, Mohammad Tariq On Thu, Dec 13, 2012 at 7:51 PM, imen Megdiche wrote: > I don t understand why you mean with "Same holds good for Hive or Pig" , > do you

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

2012-12-13 Thread imen Megdiche
I don t understand why you mean with "Same holds good for Hive or Pig" , do you mean i would rather compare datawarehouses with hive or Pig. Great, you help me so much. Mohammad. 2012/12/13 Mohammad Tariq > If you are going to do some OLTP kinda thing, I would not suggest Hadoop. > Same holds

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

2012-12-13 Thread Mohammad Tariq
You are welcome. First things first. We can never compare Hadoop with traditional warehouse systems or DBMSs. Both are meant for different purposes. One small example, consider you have 1G of data, then there is nothing that could match RDBMSs. You'll get the results instantly, as you have specif

HiveQL include/load other hiveql script ?

2012-12-13 Thread Alexandre Fouche
Hi Is there a HiveQL statement to load, import or execute another HiveQL script, either local or in HDFS ? (yes, i already search in docs and book) AF

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

2012-12-13 Thread imen Megdiche
thank you for your explanantions. I work in a pseudo distributed mode and not in cluster. Does your recommendation are also available in this mode and how can i do to have an execution time increasing in function of the nbr of map reduces tasks, if it is possible. I don t understand in general ho

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

2012-12-13 Thread Mohammad Tariq
Hello Imen, If you have huge no of tasks then the overhead of managing the map and reduce task creation begins to dominate the total job execution time. Also, more tasks means you need more free cpu slots. If the slots are not free then the data block of interest will be moved to some other

Re: Incresing map reduce tasks will increse the time of the cpu does this seem to be correct

2012-12-13 Thread Nitin Pawar
If the number of maps or reducers your job launched are more than the jobqueue/cluster capacity, cpu time will increase On Dec 13, 2012 4:02 PM, "imen Megdiche" wrote: > Hello, > > I am trying to increase the number of map and reduce tasks for a job and > even for the same data size, I noticed th

答复: REST API for Hive queries?

2012-12-13 Thread Chenbenhua
I agree with Manish Malhotra You should put an eye on the access speed if you want display the data on an UI application Push the result data to RDBS may be the best choice. 发件人: Manish Malhotra [mailto:manish.hadoop.w...@gmail.com] 发送时间: 2012年12月13日 16:15 收件人: user@hive.apache.org 主题: Re: REST A

Re: Hive Thrift upgrade to 0.9.0

2012-12-13 Thread Manish Malhotra
Looks like https://issues.apache.org/jira/browse/HCATALOG-541 is also related though the issue looks like when dealing with large number of partitions. Regards, Manish On Wed, Dec 12, 2012 at 5:59 PM, Shreepadma Venugopalan < shreepa...@cloudera.com> wrote: > > > > On Tue, Dec 11, 2012 at 12:

Re: REST API for Hive queries?

2012-12-13 Thread Jagat Singh
If your requirement is that queries are not going to be run on fly then i would suggest following. 1) Create Hive script 2) Combine it with Oozie workflow to run at scheduled time and push results to some DB say MySQL 3) Use some application to talk to MySQL and generate those reports. Thanks, J

Re: REST API for Hive queries?

2012-12-13 Thread Manish Malhotra
Ideally, push the aggregated data to some RDBMS like MySQL and have REST API or some API to enable ui to build report or query out of it. If the use case is ad-hoc query then once that qry is submitted, and result is generated in batch mode, the REST API can be provided to get the results from HDF