Elaine, Nitin raises some good points. Continuing on the same lines, let's take a closer look at the query: insert overwrite table B select a, b, c from table A where datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), request_date) <= 30
In the above query, "datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), request_date)" would cause this set of nested functions to be evaluated for every record in your 6 GB dataset on the server. It would be best if this computation was done in your client (bash script or java code issuing hive queries) so that the query that gets sent to the Hive server looks like: "request_date >= '2012-01-01' and request_date < '2012-06-01'" That would shave off a lot of time. If the performance is still poor, consider partitioning your data (based on date?), also make sure you don't suffer from the small file problem: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ Good luck! Mark On Wed, Dec 12, 2012 at 11:36 PM, Nitin Pawar <nitinpawar...@gmail.com> wrote: > 6GB size is nothing. We have done it with few TB of data in hive. > Error you are seeing is on the hadoop side. > > You can always optimize your query based on the hadoop compute capacity you > have got and also based on the pattern in the data you will need to design > your schema. > > The problem here can be you have got a fucntion to execute in the where > clause. Can you try hard coding them to data range and see if you can get > any improvements. > > Alternatively if you can partition your data on date basis, smaller dataset > you will have to read. > > If you got good size hadoop cluster then lower the split size and launch > many maps that way it will get executed quickly > > by the heapsize increase did you mean increase hive heapsize or hadoop > mapred heapsize ? You will need to increase the heapsize on mapred by > setting the property > set mapred.job.map.memory.mb=6000; > set mapred.job.reduce.memory.mb=4000; > > > > On Wed, Dec 12, 2012 at 3:13 PM, Elaine Gan <elaine-...@gmo.jp> wrote: >> >> Hi, >> >> I'm trying to run a program on Hadoop. >> >> [Input] tsv file >> >> My program does the following. >> (1) Load tsv into hive >> load data local inpath 'tsvfile' overwrite into table A partitioned >> by xx >> (2) insert overwrite table B select a, b, c from table A where >> datediff(to_date(from_unixtime(unix_timestamp('${logdate}'))), request_date) >> <= 30 >> (3) Running Mahout >> >> In step 2, i am trying to retrieve data from hive for the past month. >> My hadoop work always stopped here. >> When i check through my browser utility it says that >> >> Diagnostic Info: >> # of failed Map Tasks exceeded allowed limit. FailedCount: 1. >> LastFailedTask: task_201211291541_0262_m_001800 >> >> Task attempt_201211291541_0262_m_001800_0 failed to report status for 1802 >> seconds. Killing! >> Error: Java heap space >> Task attempt_201211291541_0262_m_001800_2 failed to report status for 1800 >> seconds. Killing! >> Task attempt_201211291541_0262_m_001800_3 failed to report status for 1801 >> seconds. Killing! >> >> >> >> Each hive table is big, around 6 GB. >> >> (1) Is it too big to have around 6GB for each hive table? >> (2) I've increased by HEAPSIZE to 50G,which i think is far more than >> enough. Any else >> where i can do the tuning? >> >> >> Thank you. >> >> >> >> rei >> >> > > > > -- > Nitin Pawar