Hi Mark, Looks great to me! Thanks for adding it. -Justin
On Tue, Apr 24, 2012 at 5:55 AM, Mark Grover <mgro...@oanda.com> wrote: > Added a tiny blurb here: > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-UDFinternals > Comments/suggestions welcome! > > Thanks for bringing it up, Justin. > > Mark > > Mark Grover, Business Intelligence Analyst > OANDA Corporation > > www: oanda.com www: fxtrade.com > e: mgro...@oanda.com > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > ----- Original Message ----- > From: "Justin Coffey" <jqcof...@gmail.com> > To: user@hive.apache.org > Sent: Monday, April 23, 2012 5:19:15 AM > Subject: Re: Lifecycle and Configuration of a hive UDF > > Hello All, > Thank you much for the responses. I can confirm that the lag function > implementation works in my case: > create temporary function lag as 'com.example.hive.udf.Lag'; > select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) > from (select session_id,hit_datetime_gmt from omni2 where > visit_day='2012-01-12' and session_id > is not null > distribute by session_id > sort by session_id,hit_datetime_gmt ) X > distribute by session_id limit 1000 > > > For the rank it looks like: > > > > create temporary function rank as 'com.example.hadoop.hive.udf.Rank'; > select user_id, time, rank(user_id) as rank > from ( > select user_id, time > from log > where day = '2012-04-01' and hour = 7 > distribute by user_id > sort by user_id, time > ) X > distribute by user_id > limit 2000 > > > As mentioned by others this appears to force the UDF to be executed Reduce > side. At least, I can't figure out how it works otherwise because only one > MapReduce job is created (with multiple reducers). > > > As a note to the documentation maintainers, it might be nice to have the > procedural workflow of UDF/UDTF/UDAF's documented in the wiki. I know it is > logical that an aggregation function happens reducer side, but I think > there is sufficient complexity in an SQL to MR translator that it is worth > the effort to explicitly document it and the other functions (or please > just bludgeon me over the head if I happened to miss it). > > > Not to be pedantic, but for example, the UDAF case study doc does not even > mention the word "reduce": > https://cwiki.apache.org/Hive/genericudafcasestudy.html > > > Thanks again to all the pointers! > > > -Justin > > > On Fri, Apr 20, 2012 at 8:18 PM, Alex Kozlov < ale...@cloudera.com > > wrote: > > > You might also look at http://www. quora > .com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hive > for a way to utilize secondary sort for analytic windowing functions. > > RANK() OVER(...) will require grouping and sorting. While it can be done > in the mapper or reducer stage, it is better to utilize Hadoop's shuffle > properties to accomplish both of them. The disadvantage may be that you can > compute only one RANK() in a MapReduce job. > > -- > > Alex K > > > > > On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans < > philip.j.trom...@gmail.com > wrote: > > > Have a read of the thread "Lag function in Hive", linked from: > > http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread > > There's an example of how to force a function to run reduce-side. I've > written a UDF which replicates RANK () OVER (...), but it requires the > syntactic sugar given in the thread. I'd like to make changes to the > hive query planner at some point, so that you can annotate a UDF with > a "run on reducer" hint, and after that I'd happily open source > everything. If you want more details of how to implement your own > partitionedRowNumber() UDF then I'd be happy to elaborate. > > Cheers, > > Phil. > > > > On 20 April 2012 18:35, Mark Grover < mgro...@oanda.com > wrote: > > Hi Rajan and Justin, > > > > As per my understanding, the scope of a UDF is only one row of data at a > time. Therefore, it can be done all map side without the need for the > reducer being involved. Now, depending on where you are storing the result > of the query, your query may have reducers that do something. > > > > A simple query like Rajan mentioned > > select MyUDF(field1,field2) from table; > > > > should have the UDF execute() being called in the map phase. > > > > > > Now to Justin's question, > > rank function ( > http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx ) > > seems to have a sytax like: > > RANK ( ) OVER ( [ partition_by_clause ] order_by_clause ) > > > > Rank function works on a collection of rows (distributed by the some > column - the same one you would use in your partition_by_clause in MS SQL). > > You can accomplish that using UDAF (read more about them at > https://cwiki.apache.org/Hive/genericudafcasestudy.html ) or by writing a > custom reducer (read about that at > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform). > > > > I don't think rank can be done using a UDF. > > > > Good luck! > > > > Mark > > > > Mark Grover, Business Intelligence Analyst > > OANDA Corporation > > > > www: oanda.com www: fxtrade.com > > > > "Best Trading Platform" - World Finance's Forex Awards 2009. > > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > > > > ----- Original Message ----- > > From: "Justin Coffey" < jqcof...@gmail.com > > > To: user@hive.apache.org > > Sent: Thursday, April 19, 2012 10:29:11 AM > > Subject: Re: Lifecycle and Configuration of a hive UDF > > > > Hello All, > > I second this question. I have a MS SQL "rank" function which I would > like to run, the results it gives appears to suggest it is executed Mapper > side as opposed to reducer side, even when run with "cluster by" > constraints. > > > > > > -Justin > > > > > > On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi < ran...@powerreviews.com> > > wrote: > > > > > > Hi, > > > > What's the lifecycle of a hive udf. If I call > > > > select MyUDF(field1,field2) from table; > > > > Then MyUDF is instantiated once per mapper, and within each mapper > execute(field1, field2) is called for each reducer? I hope this is the > case, but I can't find anything about this in the documentation. > > > > So I'd like to have some run-time configuration of my UDF: I'm curious > how people do this. Is there a way I can send it a value or have it access > a file, etc? How about performing a query against the hive store? > > > > Thanks, > > > > Ranjan > > > > > > > > > > > > -- > > jqcof...@gmail.com > > ----- > > > > > > -- > jqcof...@gmail.com > ----- > -- jqcof...@gmail.com -----