Re: The dreaded Heap Space Issue on a Transform

2013-01-30 Thread Philip Tromans
That particular OutOfMemoryError is happening on one of your hadoop nodes. It's the heap within the process forked by the hadoop tasktracker, I think. Phil. On 30 January 2013 14:28, John Omernik wrote: > So just a follow-up. I am less looking for specific troubleshooting on how > to fix my pr

Re: Is this a known Bug: Multi Inserts from partitioned source ignore Where Clauses

2013-01-26 Thread Philip Tromans
This is a known (recently fixed) bug: https://issues.apache.org/jira/browse/HIVE-3699 Phil. On 26 January 2013 15:17, John Omernik wrote: > I ran into an interesting bug. Basically, if your FROM() source is > a partitioned table and you use a where clause that prunes, all of the > INSERT HERE

Re: Join not working in HIVE

2012-12-17 Thread Philip Tromans
Hive doesn't support theta joins. Your best bet is to do a full cross join between the tables, and put your range conditions into the WHERE clause. This may or may not work, depending on the respective sizes of your tables. The fundamental problem is that parallelising a theta (or range) join via

Re: need help on writing hive query

2012-10-31 Thread Philip Tromans
You could use collect_set() and GROUP BY. That wouldn't preserve order though. Phil. On Oct 31, 2012 9:18 PM, "qiaoresearcher" wrote: > Hi all, > > here is the question. Assume we have a table like: > > -

Re: Hive Query Unable to distribute load evenly in reducers

2012-10-18 Thread Philip Tromans
I'm really not convinced that there's no skew in your data. Look at the counters from the Hadoop TaskTracker pages, and thoroughly check that the numbers of reducer input records / groups and output records are all similar. Phil. On 18 October 2012 09:56, Saurabh Mishra wrote: > any views on the

Re: Hive Query Unable to distribute load evenly in reducers

2012-10-15 Thread Philip Tromans
Is your data heavily skewed towards certain values of a.x etc? On 15 October 2012 15:23, Saurabh Mishra wrote: > The queries are simple joins, something on the lines of > select a, b, c, count(D) from tableA join tableB on a.x=b.y join group > by a, b,c; > > >> From: liy...@gmail.com >> Date:

RE: ERROR: Hive subquery showing

2012-09-27 Thread Philip Tromans
How about: select name from ABC order by grp desc limit 1? Phil. On Sep 27, 2012 9:02 PM, "yogesh dhari" wrote: > Hi Bejoy, > > I tried this one also but here it throws horrible error: > > i.e: > > hive: select name from ABD where grp=MAX(grp); > > FAILED: Hive Internal Error: java.lang.NullPoi

Re: How to set default value for a certain field?

2012-09-05 Thread Philip Tromans
t; > select value,COALESCE(value,3) from testtest; > 1 1 > 1 1 > 2 2 > NULL3 > NULL3 > > On Wed, Sep 5, 2012 at 7:52 PM, Philip Tromans > wrote: > > You could do something with the coalesce UDF? > > > > Phil. > > > >

Re: How to set default value for a certain field?

2012-09-05 Thread Philip Tromans
You could do something with the coalesce UDF? Phil. On Sep 5, 2012 12:24 AM, "MiaoMiao" wrote: > I have a file whose content is: > 1,1 > 2,1 > 3,2 > 4, > 5, > Then I import in into a hive table. > create external table testtest (id int,value int) row format delimited > fields terminated by ',' s

Re: unexplode?

2012-08-23 Thread Philip Tromans
insert into originalTable select uniqueId, collect_set(whatever) from explodedTable group by uniqueId will probably do the trick. Phil. On 23 August 2012 17:45, Mike Fleming wrote: > I see that hive has away to take a table and produce multiple rows. > > Is there a built in way to do the revers

Re: About Hive Index

2012-08-21 Thread Philip Tromans
There's a case bug in hive. Put all the names into lower case. I've got a JIRA open about it somewhere. Phil. On Aug 22, 2012 4:39 AM, "Lin" wrote: > Hi, > > I build a compact index IX for table A as follows, > > create index IX on table A(a, b) as 'COMPACT' > with deferred rebuild > in table A_

Re: New Issue raised in Jira

2012-08-14 Thread Philip Tromans
What you're trying to do can be achieved with: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions with a "D" in a format string. See: http://docs.oracle.com/javase/1.4.2/docs/api/java/text/SimpleDateFormat.html Phil. On 14 August 2012 07:30, Deep

Re: NOT IN clause in Hive

2012-08-14 Thread Philip Tromans
https://cwiki.apache.org/Hive/languagemanual-joins.html On 14 August 2012 10:29, Prakrati Agrawal wrote: > Dear Phil, > > Can you be a liitle more specific about using the left outer join? > > Thanks and Regards, > Prakrati > > -Original Message

Re: NOT IN clause in Hive

2012-08-14 Thread Philip Tromans
Hive doesn't support IN. You'll need to rewrite your query as a left outer join, and check whether the RHS is null. Phil. On 14 August 2012 10:20, Bertrand Dechoux wrote: > According to the error message, you are not using the correct synthax : > https://cwiki.apache.org/confluence/display/Hive/

Re: ORDER BY does not work with CONCAT operation

2012-08-10 Thread Philip Tromans
I think you're ordering by a constant. Give your concat column an alias, and then order by that. Phil. On 10 August 2012 12:26, Joshi, Rekha wrote: > Manisha, when you say concat issue, did you verify the stmt without concat > (just any few fields to test) and that gives ordered data correctly?

Re: Something wrong with my query to get TOP 3?

2012-07-19 Thread Philip Tromans
Your rank() is being evaluated map side. Put your distribute by and sort by in an inner query, and then evaluate your rank() in an outer query. Phil. On Jul 19, 2012 9:00 PM, "comptech geeky" wrote: > This is the below data in my Table1 > > > BID PID TIME > --

Re: Starting hive thrift server as daemon process ?

2012-06-14 Thread Philip Tromans
A really quick (but by no means as good) solution is to use screen. http://www.gnu.org/software/screen/ Phil. On 14 June 2012 13:38, dong.yajun wrote: > Hi Praveenesh > > have a look at > http://blog.milford.io/2010/06/daemonizing-the-apache-hive-thrift-server-on-centos/ > :) > > Thanks . > > >

Importing data into Hive

2012-06-06 Thread Philip Tromans
Hi all, I'm interested in knowing how everyone is importing their data into their production Hive clusters. Let me explain a little more. At the moment, I have log files (which are divided into 5 minute chunks, per event type (of which there are around 10), per server (a few 10s) arriving on one

Re: dynamic partition import

2012-05-29 Thread Philip Tromans
Is there anything interesting in the datanode logs? Phil. On 29 May 2012 10:37, Nitin Pawar wrote: > can you check atleast one datanode is running and is not part of blacklisted > nodes > > > On Tue, May 29, 2012 at 3:01 PM, Nimra Choudhary > wrote: >> >> >> >> We are using Dynamic partitioning

Re: Map side aggregations

2012-05-23 Thread Philip Tromans
Hi Ranjith, I haven't checked the code (so this might not be true), but I think that the map side aggregation stuff uses it's own hash map within the map phase to do the aggregation, instead of using a combiner, so you wouldn't expect to see any combine input records. Have a look for parameters li

Re: Date format - any easier way

2012-05-15 Thread Philip Tromans
I knocked up the following when we were experimenting with Hive. I've been meaning to go and tidy it up for a while, but using it with a separator of "" (empty string) should have the desired effect. (Obviously the UDF throws an exception if the array is empty, been meaning to fix that for a while.

Re: Lifecycle and Configuration of a hive UDF

2012-04-20 Thread Philip Tromans
Have a read of the thread "Lag function in Hive", linked from: http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread There's an example of how to force a function to run reduce-side. I've written a UDF which replicates RANK () OVER (...), but it requires the syntactic sugar given

Re: nested UDFs on Partition column

2012-04-19 Thread Philip Tromans
that left hand side should be evaluated at compile time, which means you have two different values of unix_timestamp() floating around, which can only end badly. Cheers, Phil. On 19 April 2012 16:35, Philip Tromans wrote: > I don't know what the state of Hive's partition pruning is

Re: nested UDFs on Partition column

2012-04-19 Thread Philip Tromans
I don't know what the state of Hive's partition pruning is, but I would imagine that the problem is that the two example you're giving are fundamentally different. 1) WHERE local_date = =date_add('2011-12-07',3) , the udf is a function of some constants, so the constant gets evaluated at compile

Re: Does Hive supports EXISTS keyword in select query?

2012-04-11 Thread Philip Tromans
Hi, Hive supports EXISTS via SEMI JOIN. Have a look at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Cheers, Phil. On 11 April 2012 13:59, Bhavesh Shah wrote: > Hello all, > I want to query like below in Hive: > Select a.* FROM tblA a JOIN tblB b ON a.field1 = b.field

Re: Lag function in Hive

2012-04-10 Thread Philip Tromans
fine. While the stand alone script works fine, when >>> the record is created in hive using std output from perl - I see 2 records >>> for some of the unique identifiers. I explored the possibility of default >>> data type changes but that does not solve the problem. >>&

Re: Lag function in Hive

2012-04-10 Thread Philip Tromans
Hi Karan, To the best of my knowledge, there isn't one. It's also unlikely to happen because it's hard to parallelise in a map-reduce way (it requires knowing where you are in a result set, and who your neighbours are and they in turn need to be present on the same node as you which is difficult t

Re: Error while reading from task log url

2012-03-29 Thread Philip Tromans
You are running into: https://issues.apache.org/jira/browse/HIVE-1579 I've been meaning to submit a patch for this. I emailed the dev list concerning a patch for it but got no reply... Hive is crashing because it can't pull the debug logs for the failed task, because it's trying to pull them from

Re: Hive server concurrency question

2012-03-28 Thread Philip Tromans
I've used Hive in a multiple connections per server instance setup. It works ok, but it is a little flakey. I have some snapshot of trunk > 0.8.0 deployed. When I have some time, I'd like to help increase the test coverage for multithreaded clients. Phil. On 28 March 2012 19:19, Abhishek Pratap S

Re: how to compute histogram on non-numeric data set?

2012-03-12 Thread Philip Tromans
Is that not just a COUNT(1) and a GROUP BY? Phil. 2012/3/12 Richard : > I have noticed histogram_numeric(col, n), but it seems to require numeric > column. > I have a string column, they are numeric like string but are category label, > e.g, > > 11, 200034 > > two different strings are two di

Re: Accessing elements from array returned by split() function

2012-03-01 Thread Philip Tromans
I guess that split(...)[1] is giving you what's inbetween the 1st and 2nd '/' character, which is nothing. Try split(...)[2]. Phil. On 1 March 2012 21:19, Saurabh S wrote: > Hello, > > I have a set of URLs which I need to parse. For example, if the url is, > http://www.google.com/anything/goes/h

RCFile and LazyBinarySerDe

2012-01-23 Thread Philip Tromans
Hi all, I'm having a problem, where I'm trying to insert into a table which has ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe', and is STORED AS RCFILE. The exception: java.lang.UnsupportedOperationException: Currently the writer can only accept BytesRefArrayWritable