Re: how to load data to partitioned table

hadoopman Sun, 14 Aug 2011 16:22:43 -0700

DISTRIBUTE BY and CLUSTER BY didn't resolve all the issues I've seenwith very large data sets. I mean I'm loading a couple terabytes in adataset and running into some rather interesting problems. I noticedhowever loading a couple months or two at a time (and making sure theywere from the same time period) seem to resolve the problems I kepthitting over and over again.

I have to keep reminding myself that hive / hadoop isn't a database andnot to treat it as such. :-)



On 08/14/2011 10:15 AM, bejoy...@yahoo.com wrote:

Ya I very much agree with you on those lines. Using the basic stuffwould literally run into memory issues with large datasets. I had someof those resolved by using the DISTRIBUTE BY clause and so. In short alittle work around over your hive queries could help you out in somecases.
Regards
Bejoy K S

------------------------------------------------------------------------
*From: * hadoopman <hadoop...@gmail.com>
*Date: *Sun, 14 Aug 2011 08:57:12 -0600
*To: *<user@hive.apache.org>
*ReplyTo: * user@hive.apache.org
*Subject: *Re: how to load data to partitioned table
Something else I've noticed is when loading LOTS of historical data,if you can try to say load a month of data at a time, try to just loadTHAT month of data and only that month. I've been able to loadseveral years of data (depending on the data) at a single load howeverthere have been times when loading a large dataset that I would runinto memory issues during the reduce phase (usually duringshuffle/sort). Things from out of memory to stack overflow messages(I've compiled a list of the more fun ones).
Then I noticed that only loading data from say a single month loadedquickly and without the memory headaches during the reduce.
Something to keep in mind and it works great!



On 08/12/2011 07:58 AM, bejoy...@yahoo.com wrote:
Hi Daniel
Just having a look at your requirement , to load data into apartition based hive table from any input file the most hassle freeapproach would be.1. Load the data into a non partitioned table that shares similarstructure as the target table.2. Populate the target table with the data from non partitioned oneusing hive dynamic partition
approach.
With Dynamic partitions you don't need to manually identify the datapartitions and distribute data accordingly.
A similar implementation is described in the blog post
www.kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html

Hope it helps

Regards
Bejoy K S

------------------------------------------------------------------------
*From: * Vikas Srivastava <vikas.srivast...@one97.net>
*Date: *Fri, 12 Aug 2011 17:31:28 +0530
*To: *<user@hive.apache.org>
*ReplyTo: * user@hive.apache.org
*Subject: *Re: how to load data to partitioned table

Hey ,

Simpley you have run query like this
FROM sales_temp INSERT OVERWRITE TABLE sales partition(period_key)SELECT *
Regards
Vikas Srivastava


2011/8/12 Daniel,Wu <hadoop...@163.com <mailto:hadoop...@163.com>>

      suppose the table is partitioned by period_key, and the csv
    file also has a column named as period_key. The csv file contains
    multiple days of data, how can we load it in the the table?

    I think of an workaround by first load the data into a
    non-partition table, and then insert the data from non-partition
    table to the partition table.

    hive> INSERT OVERWRITE TABLE sales SELECT * FROM sales_temp;
    FAILED: Error in semantic analysis: need to specify partition
    columns because the destination table is partitioned.


    However it doesn't work also. please help.





--
With Regards
Vikas Srivastava

DWH & Analytics Team
Mob:+91 9560885900
One97 | Let's get talking !

Re: how to load data to partitioned table

Reply via email to