Dear Nitin and Mark,
Thanks for the prompt response.
(1) optimizing the query
I've tried using function in my where clause and just specify the date
range. It still doesnt work, so i guess i need to check around.
But still, this is a good advice. Will keep this in mind.
(2) heapsize
> > by the h
Thanks for the pointer to HiTune. The dataflow graphs in the paper looks nice.
The potential issues I can see:
1) the data collection requires a Chukwa cluster being set up. Seems
too heavy-weight?
2) drill down analysis. Besides those graphs shown in the paper, can
users further drill down to the
You may have a try for HiTune & HiBench. Just google for them.
-Original Message-
From: Jie Li [mailto:ji...@cs.duke.edu]
Sent: Friday, December 14, 2012 10:02 AM
To: user@hive.apache.org
Subject: A tool to analyze and tune performance for Hive?
Hi everyone,
May I know if there is any t
Hi everyone,
May I know if there is any tool available to analyze and tune the
performance for Hive queries? And what is the state of the art?
I had some experience on tuning Pig, based on manually clicking JT web
pages and collecting pieces of data from here and there, and guessing
what might be
>From the cli the source command should read in another file. They can be
nested I believe.
On Thursday, December 13, 2012, Alexandre Fouche <
alexandre.fou...@cleverscale.com> wrote:
> Hi
> Is there a HiveQL statement to load, import or execute another HiveQL
script, either local or in HDFS ?
> (
Thanks Nitin. This is all I want to clarify :)
Chen
On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar wrote:
> to improve the speed of the job they created map only joins so that all
> the records associated with a key fall to a map .. reducers slows it down.
> If the reducer has to do some more job
to improve the speed of the job they created map only joins so that all the
records associated with a key fall to a map .. reducers slows it down. If
the reducer has to do some more job then they launch another job.
bear in mind, when we say map only join we are absolutely sure that speed
will inc
Nitin
Yeah. My original question is that is there a way to force Hive (or rather
to say, is it possible) to execute map side join at mapper phase and group
by in reduce phase. So instead of launching a map only job (join) and map
reduce job (group by), doing it altogether in a single MR job. This
chen in mapside join .. there are no reducers .. its MAP ONLY job
On Thu, Dec 13, 2012 at 11:54 PM, Chen Song wrote:
> Understood that fact that it is impossible in the same MR job if both join
> and group by are gonna happen in the reduce phase (because the join keys
> and group by keys are di
Thanks for the help.
What I did earlier is that I changed the configuration in HDFS and created
the table. I expected that the block size of the new Table to be of 32 MB.
But I found that while using Cloudera Manager you need to deploy Change in
Configuration of both the HDFS and Mapreduce. (I did
Understood that fact that it is impossible in the same MR job if both join
and group by are gonna happen in the reduce phase (because the join keys
and group by keys are different). But for map side join, the joins would be
complete by the end of the map phase, and outputs should be ready to be
dis
Hi Souvik
To have the new hdfs block size in effect on the already existing files, you
need to re copy them into hdfs.
To play with the number of mappers you can set lesser value like 64mb for min
and max split size.
Mapred.min.split.size and mapred.max.split.size
Regards
Bejoy KS
Sent from
Hi Bejoy,
The input files are non-compressed text file.
There are enough free slots in the cluster.
Can you please let me know can I increase the no of mappers?
I tried reducing the HDFS block size to 32 MB from 128 MB. I was expecting
to get more mappers. But still it's launching same no of mapp
Hi Souvik
Is your input files compressed using some non splittable compression codec?
Do you have enough free slots while this job is running?
Make sure that the job is not running locally.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: Souvik
Thats because for the first job the join keys are different and second job
group by keys are different, you just cant assume join keys and group keys
will be same so they are two different jobs
On Thu, Dec 13, 2012 at 8:26 PM, Chen Song wrote:
> Yeah, my abridged version of query might be a lit
Yeah, my abridged version of query might be a little broken but my point is
that when a query has a map join and group by, even in its simplified
incarnation, it will launch two jobs. I was just wondering why map join and
group by cannot be accomplished in one MR job.
Best,
Chen
On Thu, Dec 13, 2
Ok, but if i would search a solution faor warehousing big data, it s rather
hive a best solution actually. I know that facebook uses Hive.
2012/12/13 Mohammad Tariq
> I said that because under the hood each query(Hive or Pig) gets converted
> into a MapReduce job first, and gives you the result
I said that because under the hood each query(Hive or Pig) gets converted
into a MapReduce job first, and gives you the result.
Regards,
Mohammad Tariq
On Thu, Dec 13, 2012 at 7:51 PM, imen Megdiche wrote:
> I don t understand why you mean with "Same holds good for Hive or Pig" ,
> do you
I don t understand why you mean with "Same holds good for Hive or Pig" ,
do you mean i would rather compare datawarehouses with hive or Pig.
Great, you help me so much. Mohammad.
2012/12/13 Mohammad Tariq
> If you are going to do some OLTP kinda thing, I would not suggest Hadoop.
> Same holds
You are welcome.
First things first. We can never compare Hadoop with traditional warehouse
systems or DBMSs. Both are meant for different purposes.
One small example, consider you have 1G of data, then there is nothing that
could match RDBMSs. You'll get the results instantly, as you have specif
Hi
Is there a HiveQL statement to load, import or execute another HiveQL script,
either local or in HDFS ?
(yes, i already search in docs and book)
AF
thank you for your explanantions. I work in a pseudo distributed mode and
not in cluster. Does your recommendation are also available in this mode
and how can i do to have an execution time increasing in function of the
nbr of map reduces tasks, if it is possible.
I don t understand in general ho
Hello Imen,
If you have huge no of tasks then the overhead of managing the map
and reduce task creation begins to dominate the total job execution time.
Also, more tasks means you need more free cpu slots. If the slots are not
free then the data block of interest will be moved to some other
If the number of maps or reducers your job launched are more than the
jobqueue/cluster capacity, cpu time will increase
On Dec 13, 2012 4:02 PM, "imen Megdiche" wrote:
> Hello,
>
> I am trying to increase the number of map and reduce tasks for a job and
> even for the same data size, I noticed th
I agree with Manish Malhotra
You should put an eye on the access speed if you want display the data on an UI
application
Push the result data to RDBS may be the best choice.
发件人: Manish Malhotra [mailto:manish.hadoop.w...@gmail.com]
发送时间: 2012年12月13日 16:15
收件人: user@hive.apache.org
主题: Re: REST A
Looks like https://issues.apache.org/jira/browse/HCATALOG-541 is also
related
though the issue looks like when dealing with large number of partitions.
Regards,
Manish
On Wed, Dec 12, 2012 at 5:59 PM, Shreepadma Venugopalan <
shreepa...@cloudera.com> wrote:
>
>
>
> On Tue, Dec 11, 2012 at 12:
If your requirement is that queries are not going to be run on fly then i
would suggest following.
1) Create Hive script
2) Combine it with Oozie workflow to run at scheduled time and push results
to some DB say MySQL
3) Use some application to talk to MySQL and generate those reports.
Thanks,
J
Ideally, push the aggregated data to some RDBMS like MySQL and have REST
API or some API to enable ui to build report or query out of it.
If the use case is ad-hoc query then once that qry is submitted, and result
is generated in batch mode, the REST API can be provided to get the results
from HDF
28 matches
Mail list logo