Mark, thanks for elaborating. I was unaware of the dynamic partitioning option, it sounds great! Gabi
On Thu, Mar 22, 2012 at 3:33 PM, Mark Grover <mgro...@oanda.com> wrote: > Hi Dan, > What Gabi is right. > > To solve your problem, you could have a non-partitioned table on the raw > data and run a Hive query that reads this raw data and inserts it into a > partitioned table. Dynamic partitioning could come in handy in that case. > Look at > https://cwiki.apache.org/Hive/tutorial.html#Tutorial-DynamicpartitionInsertfor > details. > > Mark > > Mark Grover, Business Intelligence Analyst > OANDA Corporation > > www: oanda.com www: fxtrade.com > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > ----- Original Message ----- > From: "Gabi D" <gabi...@gmail.com> > To: user@hive.apache.org > Sent: Thursday, March 22, 2012 4:16:58 AM > Subject: Re: Create Partitioned Table w/ Partition= Substring of Raw Data > > > Dan, > the partition value does not look at your raw data, you assign a value to > the partition when you put the data in. > So what you need to do is this: > > Create table mytable (Time string, OtherData string) > Partition by (danDate string); (never a good idea to give fields a name > that's a reserved word, such as 'date') > > and when you put the data in using load for example, you need to specify > the actual value you want to assign to 'danDate'.Meaning you should know > what are the date values that are inside your file (and choose only one). > (e.g., > load data local inpath '<file location>' into table mytable partition( > danDate='01-01-2000' ); > ) > > Note that danDate is not a field from within your files, but it is > actually used by hive to create a subdirectory under your table's hdfs > location which will be named: > 'danDate=01-01-2000' (if the value you gave it is 01-01-2000) > hive always shows it as the last field in your 'describe <table>' commands > though,again, it is not a regular field. > > In this respect, it's better to use a date format that will be comparable, > such as 'yyyy-mm-dd' so you will be able to run selects such as: > select count(*) form mytable where danDate >='2012-01-01' and danDate > <'2012-02-01' > > and hive will be able to run this using partition pruning (which means > only read files in the partition directories needed to satisfy your query). > > I hope I didn't go over stuff you already know and that this helps... > > > > On Wed, Mar 21, 2012 at 5:07 PM, Dan Y < dan.m.ye...@gmail.com > wrote: > > > Hi All, > > > My raw data looks like this: > DateTime,OtherData > 01-01-2000-01:00:00,blablabla1 > 01-01-2000-04:00:00,blablabla2 > 01-02-2000-02:00:00,blablabla3 > > > I would like to partition on the datepart of DateTime. What does not work, > unfortunately, is this: > > Create table mytable (DateTime string, OtherData string) > Partition by ( substr(DateTime,1,10) string); > > > I wish my raw data instead looked like: > > Date ,Time ,OtherData > > 01-01-2000 ,01:00:00 ,blablabla1 > 01-01-2000 ,04:00:00 ,blablabla2 > 01-02-2000 ,02:00:00 ,blablabla3 > > > ...with Time a distinct field. Then I could use: > Create table mytable (Time string, OtherData string) > Partition by (Date string); > > > Any ideas for the best way to load my raw data into a hive table > partitioned by the datepart of DateTime? The files are gynormous, so > manipulating the raw data outside of Hive is not feasible for this problem. > I would like to avoid using Select in the solution as well since my hive > table will refer to zipped data (and the Select therefore would come with a > big runtime cost). > > > Thanks!! > Dan >