Re: Length of an array

2012-03-21 Thread James Warren
https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-CollectionFunctions cheers, -James On Wed, Mar 21, 2012 at 4:30 PM, Saurabh S wrote: > How do I get the length of an array in Hive? > > Specifically, I'm looking at the following problem: I'm splitting a column > using the

Length of an array

2012-03-21 Thread Saurabh S
How do I get the length of an array in Hive? Specifically, I'm looking at the following problem: I'm splitting a column using the split() function and a pattern. However, the resulting array can have variable number of entries and I want to handle each case separately.

Re: LOAD DATA problem

2012-03-21 Thread Sean McNamara
I have filed a JIRA that describes the desired 'IF NOT EXISTS' functionality: https://issues.apache.org/jira/browse/HIVE-2889 From: Gabi D mailto:gabi...@gmail.com>> Reply-To: mailto:user@hive.apache.org>> Date: Wed, 21 Mar 2012 10:52:25 +0200 To: mailto:user@hive.apache.org>> Subject: Re: LOAD

Create Partitioned Table w/ Partition= Substring of Raw Data

2012-03-21 Thread Dan Y
Hi All, My raw data looks like this: DateTime,OtherData 01-01-2000-01:00:00,blablabla1 01-01-2000-04:00:00,blablabla2 01-02-2000-02:00:00,blablabla3 I would like to partition on the datepart of DateTime. What does *not *work, unfortunately, is this: Create table mytable (DateTime

Re: HIVE mappers eat a lot of RAM

2012-03-21 Thread Alexander Ershov
I figured it out. To help the future generations: The problem was in property hive.groupby.mapaggr.checkinterval which defaults to 10. Since I was doing 'group by' query and each row was 4Kb and each mapper got only 3 rows, no mapper had an opportunity to do whatever checkinterval option w

Re: LOAD DATA problem

2012-03-21 Thread Gabi D
We also do the check before loading the file into hive, but we're not very happy with this solution. A hack on the backend is better since a hack on the front end has to happen for every file while a hack on the backend would actually happen only for duplicate files. So performance wise backend is