Mark,
thanks for elaborating. I was unaware of the dynamic partitioning option,
it sounds great!
Gabi

On Thu, Mar 22, 2012 at 3:33 PM, Mark Grover <mgro...@oanda.com> wrote:

> Hi Dan,
> What Gabi is right.
>
> To solve your problem, you could have a non-partitioned table on the raw
> data and run a Hive query that reads this raw data and inserts it into a
> partitioned table. Dynamic partitioning could come in handy in that case.
> Look at
> https://cwiki.apache.org/Hive/tutorial.html#Tutorial-DynamicpartitionInsertfor
>  details.
>
> Mark
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Gabi D" <gabi...@gmail.com>
> To: user@hive.apache.org
> Sent: Thursday, March 22, 2012 4:16:58 AM
> Subject: Re: Create Partitioned Table w/ Partition= Substring of Raw Data
>
>
> Dan,
> the partition value does not look at your raw data, you assign a value to
> the partition when you put the data in.
> So what you need to do is this:
>
> Create table mytable (Time string, OtherData string)
> Partition by (danDate string); (never a good idea to give fields a name
> that's a reserved word, such as 'date')
>
> and when you put the data in using load for example, you need to specify
> the actual value you want to assign to 'danDate'.Meaning you should know
> what are the date values that are inside your file (and choose only one).
> (e.g.,
> load data local inpath '<file location>' into table mytable partition(
> danDate='01-01-2000' );
> )
>
> Note that danDate is not a field from within your files, but it is
> actually used by hive to create a subdirectory under your table's hdfs
> location which will be named:
> 'danDate=01-01-2000' (if the value you gave it is 01-01-2000)
> hive always shows it as the last field in your 'describe <table>' commands
> though,again, it is not a regular field.
>
> In this respect, it's better to use a date format that will be comparable,
> such as 'yyyy-mm-dd' so you will be able to run selects such as:
> select count(*) form mytable where danDate >='2012-01-01' and danDate
> <'2012-02-01'
>
> and hive will be able to run this using partition pruning (which means
> only read files in the partition directories needed to satisfy your query).
>
> I hope I didn't go over stuff you already know and that this helps...
>
>
>
> On Wed, Mar 21, 2012 at 5:07 PM, Dan Y < dan.m.ye...@gmail.com > wrote:
>
>
> Hi All,
>
>
> My raw data looks like this:
> DateTime,OtherData
> 01-01-2000-01:00:00,blablabla1
> 01-01-2000-04:00:00,blablabla2
> 01-02-2000-02:00:00,blablabla3
>
>
> I would like to partition on the datepart of DateTime. What does not work,
> unfortunately, is this:
>
> Create table mytable (DateTime string, OtherData string)
> Partition by ( substr(DateTime,1,10) string);
>
>
> I wish my raw data instead looked like:
>
> Date ,Time ,OtherData
>
> 01-01-2000 ,01:00:00 ,blablabla1
> 01-01-2000 ,04:00:00 ,blablabla2
> 01-02-2000 ,02:00:00 ,blablabla3
>
>
> ...with Time a distinct field. Then I could use:
> Create table mytable (Time string, OtherData string)
> Partition by (Date string);
>
>
> Any ideas for the best way to load my raw data into a hive table
> partitioned by the datepart of DateTime? The files are gynormous, so
> manipulating the raw data outside of Hive is not feasible for this problem.
> I would like to avoid using Select in the solution as well since my hive
> table will refer to zipped data (and the Select therefore would come with a
> big runtime cost).
>
>
> Thanks!!
> Dan
>

Reply via email to