Re: help on failed MR jobs (big hive files)

Elaine Gan Thu, 13 Dec 2012 22:55:24 -0800

Dear Nitin and Mark,

Thanks for the prompt response.


(1) optimizing the query
I've tried using function in my where clause and just specify the date
range. It still doesnt work, so i guess i need to check around.
But still, this is a good advice. Will keep this in mind.

(2) heapsize
> > by the heapsize increase did you mean increase hive heapsize or hadoop
> > mapred heapsize ?  You will need to increase the heapsize on mapred by
> > setting the property
> > set mapred.job.map.memory.mb=6000;
> > set mapred.job.reduce.memory.mb=4000;

I increase the hadoop heapsize, the one defined hadoop-env.sh
export HADOOP_HEAPSIZE=50000

I see, i will increase the size on the mapred.


(3) Partitioning 
My datas are patitioned by request_date.
Each partition has around 6G of data, which means for one month it'll
have around 180G.
I have 10 datanodes, which means the total input are 180x10=1800G=1.8TB

(4) split size
> > If you got good size hadoop cluster then lower the split size and launch
> > many maps that way it will get executed quickly

My split size is 128MB. Is this consider big? I guess it's small enought?
And i have mapred.tasktracker.map.tasks.maximum=4.
Guess i will need to optimize this part.


I will update again on my result.

Thank you.


> Elaine,
> Nitin raises some good points.
> 
> Continuing on the same lines, let's take a closer look at the query:
> insert overwrite table B select a, b, c from table A where
> datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))),
> request_date) <= 30
> 
> In the above query,
> "datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))),
> request_date)" would cause this set of nested functions to be
> evaluated for every record in your 6 GB dataset on the server. It
> would be best if this computation was done in your client (bash script
> or java code issuing hive queries) so that the query that gets sent to
> the Hive server looks like: "request_date >= '2012-01-01' and
> request_date < '2012-06-01'"
> 
> That would shave off a lot of time. If the performance is still poor,
> consider partitioning your data (based on date?), also make sure you
> don't suffer from the small file problem:
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
> 
> Good luck!
> Mark
> 
> 
> On Wed, Dec 12, 2012 at 11:36 PM, Nitin Pawar <nitinpawar...@gmail.com> wrote:
> > 6GB size is nothing. We have done it with few TB of data in hive.
> > Error you are seeing is on the hadoop side.
> >
> > You can always optimize your query based on the hadoop compute capacity you
> > have got and also based on the pattern in the data you will need to design
> > your schema.
> >
> > The problem here can be you have got a fucntion to execute in the where
> > clause. Can you try hard coding them to data range and see if you can get
> > any improvements.
> >
> > Alternatively if you can partition your data on date basis, smaller dataset
> > you will have to read.
> >
> > If you got good size hadoop cluster then lower the split size and launch
> > many maps that way it will get executed quickly
> >
> > by the heapsize increase did you mean increase hive heapsize or hadoop
> > mapred heapsize ?  You will need to increase the heapsize on mapred by
> > setting the property
> > set mapred.job.map.memory.mb=6000;
> > set mapred.job.reduce.memory.mb=4000;
> >
> >
> >
> > On Wed, Dec 12, 2012 at 3:13 PM, Elaine Gan <elaine-...@gmo.jp> wrote:
> >>
> >> Hi,
> >>
> >> I'm trying to run a program on Hadoop.
> >>
> >> [Input] tsv file
> >>
> >> My program does the following.
> >> (1) Load tsv into hive
> >>       load data local inpath 'tsvfile' overwrite into table A partitioned
> >> by xx
> >> (2) insert overwrite table B select a, b, c from table A where
> >> datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), 
> >> request_date)
> >> <= 30
> >> (3) Running Mahout
> >>
> >> In step 2, i am trying to retrieve data from hive for the past month.
> >> My hadoop work always stopped here.
> >> When i check through my browser utility it says that
> >>
> >> Diagnostic Info:
> >> # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
> >> LastFailedTask: task_201211291541_0262_m_001800
> >>
> >> Task attempt_201211291541_0262_m_001800_0 failed to report status for 1802
> >> seconds. Killing!
> >> Error: Java heap space
> >> Task attempt_201211291541_0262_m_001800_2 failed to report status for 1800
> >> seconds. Killing!
> >> Task attempt_201211291541_0262_m_001800_3 failed to report status for 1801
> >> seconds. Killing!
> >>
> >>
> >>
> >> Each hive table is big, around 6 GB.
> >>
> >> (1) Is it too big to have around 6GB for each hive table?
> >> (2) I've increased by HEAPSIZE to 50G,which i think is far more than
> >> enough. Any else
> >> where i can do the tuning?
> >>
> >>
> >> Thank you.
> >>
> >>
> >>
> >> rei
> >>
> >>
> >
> >
> >
> > --
> > Nitin Pawar

Re: help on failed MR jobs (big hive files)

Reply via email to