Re: Lifecycle and Configuration of a hive UDF

Mark Grover Fri, 20 Apr 2012 10:35:45 -0700

Hi Rajan and Justin,

As per my understanding, the scope of a UDF is only one row of data at a time. 
Therefore, it can be done all map side without the need for the reducer being 
involved. Now, depending on where you are storing the result of the query, your 
query may have reducers that do something.


A simple query like Rajan mentioned
select MyUDF(field1,field2) from table; 

should have the UDF execute() being called in the map phase.


Now to Justin's question,
rank function 
(http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx)
seems to have a sytax like:
RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )

Rank function works on a collection of rows (distributed by the some column - 
the same one you would use in your partition_by_clause in MS SQL).
You can accomplish that using UDAF (read more about them at 
https://cwiki.apache.org/Hive/genericudafcasestudy.html) or by writing a custom 
reducer (read about that at 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform).

I don't think rank can be done using a UDF.

Good luck!

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "Justin Coffey" <jqcof...@gmail.com>
To: user@hive.apache.org
Sent: Thursday, April 19, 2012 10:29:11 AM
Subject: Re: Lifecycle and Configuration of a hive UDF

Hello All, 
I second this question. I have a MS SQL "rank" function which I would like to 
run, the results it gives appears to suggest it is executed Mapper side as 
opposed to reducer side, even when run with "cluster by" constraints. 


-Justin 


On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi < ran...@powerreviews.com > 
wrote: 


Hi, 

What's the lifecycle of a hive udf. If I call 

select MyUDF(field1,field2) from table; 

Then MyUDF is instantiated once per mapper, and within each mapper 
execute(field1, field2) is called for each reducer? I hope this is the case, 
but I can't find anything about this in the documentation. 

So I'd like to have some run-time configuration of my UDF: I'm curious how 
people do this. Is there a way I can send it a value or have it access a file, 
etc? How about performing a query against the hive store? 

Thanks, 

Ranjan 





-- 
jqcof...@gmail.com 
-----

Re: Lifecycle and Configuration of a hive UDF

Reply via email to