Re: UDF Configure method not getting called

Jason Dere Wed, 26 Aug 2015 17:38:38 -0700

I don't think I understand your use case very well. Would you be able to give a 
bit more detail on what you are trying to do here, and how your UDF was 
accomplishing that? Maybe someone might have a suggestion.

________________________________
From: Rahul Sharma <kippy....@gmail.com>
Sent: Wednesday, August 26, 2015 9:39 AM
To: user@hive.apache.org
Subject: Re: UDF Configure method not getting called

Thanks again Jason. I tried hive.fetch.task.conversion=minimal/none and it ran 
a map-reduce task and the UDF ran fine. The problem with this approach is that 
the property needs to be changed in conf.server's hive-site and thus affects 
every query. A workaround would be to add "hive.fetch.task.conversion" to 
confwhitelist.append and then modify it at runtime. That should work for the 
time being I think.

However, I kind of feel that our use case of the UDF for doing compute 
intensive work on map nodes via HiveQL is diverging away from how the devs see 
UDFs or how the community uses them. It would be interesting to know your 
thoughts on this.

Is there another way that one can use to cherry pick columns of different hive 
tables and then perform whatever our UDF does but without using UDF's at all? I 
looked at transformation scripts a while back but I don't think those would 
work for our use case either.

On Tue, Aug 25, 2015 at 5:05 PM, Jason Dere 
<jd...@hortonworks.com<mailto:jd...@hortonworks.com>> wrote:

For getting the configuration without configure(), this may not be the best 
thing to do but you can try during your UDF's initialize() method. Note that 
initialize() is called during query compilation, and also by each M/R task 
(followed at some point by configure()).

During initialize() you can call SessionState.get(), if it is not null then 
this initialize() call is happening during query compilation, and you can then 
use SessionState.get().getConf() to get the configuration. 
GenericUDFBaseNumeric has an example of this.

As for trying to force map/reduce jobs, you can try 
hive.fetch.task.conversion=minimal/none and 
hive.optimize.constant.propagation=false and see how it works.

________________________________
From: Rahul Sharma <kippy....@gmail.com<mailto:kippy....@gmail.com>>
Sent: Tuesday, August 25, 2015 2:48 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: UDF Configure method not getting called

Or alternatively, is there a way to pass configuration without using the 
configure method?

The configuration to the UDF is essentially a list of parameters that tells the 
UDF, what it should morph into this time and what kind of work it should 
perform. If there is an all encompassing way to do that, then I can modify the 
UDF to run irrespective if its run locally or with MapRed context.

On Tue, Aug 25, 2015 at 2:44 PM, Rahul Sharma 
<kippy....@gmail.com<mailto:kippy....@gmail.com>> wrote:
Oh thanks for the reply, Jason. That was my suspicion too.

The UDF in our case is not a function per say in pure mathematical sense of the 
word 'function'. That is because, it doesn't take in a value and give out 
another value. It has side effects, that form input for another MapReduce job. 
The point of doing it this way is that we wanted to make use of the parallelism 
that would be afforded by running it as a map reduce job via hive, as the 
processing is fairly compute extensive.

Is there a way to force map-reduce jobs? I think hive.fetch.task.conversion to 
minimal might help, is there anything that can be done?

Thanks a ton.

On Tue, Aug 25, 2015 at 2:36 PM, Jason Dere 
<jd...@hortonworks.com<mailto:jd...@hortonworks.com>> wrote:

There might be a few cases where a UDF is executed locally and not as part of 
a Map/Reduce job:

 - Hive might choose not to run a M/R task for your query (see 
hive.fetch.task.conversion)

 - If the UDF is deterministic and has deterministic inputs, Hive might decide 
to run the UDF once to get the value and use constant folding to replace calls 
of that UDF with the value from the one UDF call (see 
hive.optimize.constant.propagation)

Taking a look at the explain plan for you query might confirm this. In those 
cases the UDF would not run within a M/R task and configure() would not be 
called.

________________________________
From: Rahul Sharma <kippy....@gmail.com<mailto:kippy....@gmail.com>>
Sent: Tuesday, August 25, 2015 11:32 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: UDF Configure method not getting called

Hi Guys,

We have a UDF which extends GenericUDF and does some configuration within the 
public void configure(MapredContext ctx) method.

MapredContext in configure method gives access to the HiveConfiguration via 
JobConf, which contains custom attributes of the form xy.abc.something. Reading 
these values is required for the semantics of the UDF.

Everything works fine till Hive 0.13, however with Hive 0.14 (or 1.0) the 
configure method of the UDF is never called by the runtime and hence the UDF 
cannot configure itself dynamically.

Is this the intended behavior? If so, what is the new way to read configuration 
of the Map Reduce Job within the UDF?

I would be grateful for any help.

Re: UDF Configure method not getting called

Reply via email to