Re: Lifecycle and Configuration of a hive UDF

Justin Coffey Tue, 24 Apr 2012 00:45:06 -0700

Hi Mark,
     Looks great to me!  Thanks for adding it.

-Justin


On Tue, Apr 24, 2012 at 5:55 AM, Mark Grover <mgro...@oanda.com> wrote:

> Added a tiny blurb here:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-UDFinternals
> Comments/suggestions welcome!
>
> Thanks for bringing it up, Justin.
>
> Mark
>
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgro...@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
> ----- Original Message -----
> From: "Justin Coffey" <jqcof...@gmail.com>
> To: user@hive.apache.org
> Sent: Monday, April 23, 2012 5:19:15 AM
> Subject: Re: Lifecycle and Configuration of a hive UDF
>
> Hello All,
> Thank you much for the responses. I can confirm that the lag function
> implementation works in my case:
> create temporary function lag as 'com.example.hive.udf.Lag';
> select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id)
> from (select session_id,hit_datetime_gmt from omni2 where
> visit_day='2012-01-12' and session_id
> is not null
> distribute by session_id
> sort by session_id,hit_datetime_gmt ) X
> distribute by session_id limit 1000
>
>
> For the rank it looks like:
>
>
>
> create temporary function rank as 'com.example.hadoop.hive.udf.Rank';
> select user_id, time, rank(user_id) as rank
> from (
> select user_id, time
> from log
> where day = '2012-04-01' and hour = 7
> distribute by user_id
> sort by user_id, time
> ) X
> distribute by user_id
> limit 2000
>
>
> As mentioned by others this appears to force the UDF to be executed Reduce
> side. At least, I can't figure out how it works otherwise because only one
> MapReduce job is created (with multiple reducers).
>
>
> As a note to the documentation maintainers, it might be nice to have the
> procedural workflow of UDF/UDTF/UDAF's documented in the wiki. I know it is
> logical that an aggregation function happens reducer side, but I think
> there is sufficient complexity in an SQL to MR translator that it is worth
> the effort to explicitly document it and the other functions (or please
> just bludgeon me over the head if I happened to miss it).
>
>
> Not to be pedantic, but for example, the UDAF case study doc does not even
> mention the word "reduce":
> https://cwiki.apache.org/Hive/genericudafcasestudy.html
>
>
> Thanks again to all the pointers!
>
>
> -Justin
>
>
> On Fri, Apr 20, 2012 at 8:18 PM, Alex Kozlov < ale...@cloudera.com >
> wrote:
>
>
> You might also look at http://www. quora
> .com/Hive-computing/How-are-SQL-type-analytic-and-windowing-functions-accomplished-in-Hadoop-Hive
> for a way to utilize secondary sort for analytic windowing functions.
>
> RANK() OVER(...) will require grouping and sorting. While it can be done
> in the mapper or reducer stage, it is better to utilize Hadoop's shuffle
> properties to accomplish both of them. The disadvantage may be that you can
> compute only one RANK() in a MapReduce job.
>
> --
>
> Alex K
>
>
>
>
> On Fri, Apr 20, 2012 at 10:54 AM, Philip Tromans <
> philip.j.trom...@gmail.com > wrote:
>
>
> Have a read of the thread "Lag function in Hive", linked from:
>
> http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/thread
>
> There's an example of how to force a function to run reduce-side. I've
> written a UDF which replicates RANK () OVER (...), but it requires the
> syntactic sugar given in the thread. I'd like to make changes to the
> hive query planner at some point, so that you can annotate a UDF with
> a "run on reducer" hint, and after that I'd happily open source
> everything. If you want more details of how to implement your own
> partitionedRowNumber() UDF then I'd be happy to elaborate.
>
> Cheers,
>
> Phil.
>
>
>
> On 20 April 2012 18:35, Mark Grover < mgro...@oanda.com > wrote:
> > Hi Rajan and Justin,
> >
> > As per my understanding, the scope of a UDF is only one row of data at a
> time. Therefore, it can be done all map side without the need for the
> reducer being involved. Now, depending on where you are storing the result
> of the query, your query may have reducers that do something.
> >
> > A simple query like Rajan mentioned
> > select MyUDF(field1,field2) from table;
> >
> > should have the UDF execute() being called in the map phase.
> >
> >
> > Now to Justin's question,
> > rank function (
> http://msdn.microsoft.com/en-us/library/ms176102%28v=sql.110%29.aspx )
> > seems to have a sytax like:
> > RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )
> >
> > Rank function works on a collection of rows (distributed by the some
> column - the same one you would use in your partition_by_clause in MS SQL).
> > You can accomplish that using UDAF (read more about them at
> https://cwiki.apache.org/Hive/genericudafcasestudy.html ) or by writing a
> custom reducer (read about that at
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform).
> >
> > I don't think rank can be done using a UDF.
> >
> > Good luck!
> >
> > Mark
> >
> > Mark Grover, Business Intelligence Analyst
> > OANDA Corporation
> >
> > www: oanda.com www: fxtrade.com
> >
> > "Best Trading Platform" - World Finance's Forex Awards 2009.
> > "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
> >
> >
> > ----- Original Message -----
> > From: "Justin Coffey" < jqcof...@gmail.com >
> > To: user@hive.apache.org
> > Sent: Thursday, April 19, 2012 10:29:11 AM
> > Subject: Re: Lifecycle and Configuration of a hive UDF
> >
> > Hello All,
> > I second this question. I have a MS SQL "rank" function which I would
> like to run, the results it gives appears to suggest it is executed Mapper
> side as opposed to reducer side, even when run with "cluster by"
> constraints.
> >
> >
> > -Justin
> >
> >
> > On Thu, Apr 19, 2012 at 1:21 AM, Ranjan Bagchi < ran...@powerreviews.com> 
> > wrote:
> >
> >
> > Hi,
> >
> > What's the lifecycle of a hive udf. If I call
> >
> > select MyUDF(field1,field2) from table;
> >
> > Then MyUDF is instantiated once per mapper, and within each mapper
> execute(field1, field2) is called for each reducer? I hope this is the
> case, but I can't find anything about this in the documentation.
> >
> > So I'd like to have some run-time configuration of my UDF: I'm curious
> how people do this. Is there a way I can send it a value or have it access
> a file, etc? How about performing a query against the hive store?
> >
> > Thanks,
> >
> > Ranjan
> >
> >
> >
> >
> >
> > --
> > jqcof...@gmail.com
> > -----
>
>
>
>
>
> --
> jqcof...@gmail.com
> -----
>



-- 
jqcof...@gmail.com
-----

Re: Lifecycle and Configuration of a hive UDF

Reply via email to