Thanks Douglas. I agree UDAFs seem nicer fit. They would solve some of the
performance bottlenecks as well (when not using reflection). However, how
would one pass configuration the said udf. In particular the type of
heuristic or in some cases the algorithm to run is specified at runtime,
with essential bits 'literally' coming from the program running the query
in some cases.

We had been using GenericUDFs configure method to pick things up from
MapReduce's jobConf from within the UDF and thus dynamically alter what the
UDF does. This is where we started running into problems when Hive started
converting MapRed tasks to FetchTasks more aggressively. This meant that
the config could not be picked up from JobConf.

What would be the best method to pass such config parameters? Parameters
that may be large in numbers and some of which could be fairly large
themselves

On Fri, Aug 28, 2015 at 3:26 PM, Moore, Douglas <
douglas.mo...@thinkbiganalytics.com> wrote:

> Writing side files from a map reduce job was more common a while ago.
> There are severe disadvantages to doing so and resulting complexities. One
> complexity is failure handling and retry, the other is speculative
> execution running multiple attempts over the same split.
>
> You say you want to look at several values from a column, this sounds to
> me like what a UDAF does, user defined aggregate function.??
>
> You can have these functions emit complex data types or blobs.
>
> If you do solve the side file challenges you can at least emit a url or
> fileid from your UDAF to make the optimizer happy.
>
>
> Your life will be much happier if you can fit your analytics within one of
> the standard extension points within hive, pig or spark that are based on
> functional programming.
>
> Best,
> Douglas
> Sent from my iPhone
>
> On Aug 28, 2015, at 5:56 PM, Rahul Sharma <kippy....@x.com> wrote:
>
> So the use case is like this:
> We want to be able to let the user point us to any number of columns in a
> table and then run analysis on the values within that column irrespective
> of the type of column (simple, complex, datatypes etc). The analysis can be
> thought of as looking at all the values or a subset of them and try to
> predict what kinds of values are within a column, classify the columns into
> classes depending on certain heuristics. The process of classifying them is
> compute intensive, so we were trying to take advantage of the fact that
> Hive launches map reduce jobs to return us the data and hence, why not push
> the compute to different nodes, instead of iterating over the Result set
> and running the analysis on a single machine.
>
> Right now, the udf gets called for each column that the user wants us to
> look at. There are number of input parameters (like in double digits), so
> it don't make sense to pas them in to the parameter list of the udf. These
> are the things which are picked up from JobConf (requires MapRedTask) and
> these modify the behavior of the UDF accordingly.
>
> I hope this makes a little bit of sense. Basically, since the algorithms
> are compute intensive and we saw hive launches MR to return us data, we
> figured, why not use UDF's to push these computations to different nodes.
> The output of UDF for each cell is not used, instead the analysis is
> available in a different place. So in some sense we use UDF purely for the
> side-effect.
>
> Thanks, again for all your help.
>
> On Wed, Aug 26, 2015 at 5:37 PM, Jason Dere <jd...@hortonworks.com> wrote:
>
>> I don't think I understand your use case very well. Would you be able to
>> give a bit more detail on what you are trying to do here, and how your UDF
>> was accomplishing that? Maybe someone might have a suggestion.
>>
>>
>>
>> ------------------------------
>> *From:* Rahul Sharma <kippy....@gmail.com>
>> *Sent:* Wednesday, August 26, 2015 9:39 AM
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: UDF Configure method not getting called
>>
>> Thanks again Jason. I tried hive.fetch.task.conversion=minimal/none and
>> it ran a map-reduce task and the UDF ran fine. The problem with this
>> approach is that the property needs to be changed in conf.server's
>> hive-site and thus affects every query. A workaround would be to add
>> "hive.fetch.task.conversion" to confwhitelist.append and then modify it at
>> runtime. That should work for the time being I think.
>>
>> However, I kind of feel that our use case of the UDF for doing compute
>> intensive work on map nodes via HiveQL is diverging away from how the devs
>> see UDFs or how the community uses them. It would be interesting to know
>> your thoughts on this.
>>
>> Is there another way that one can use to cherry pick columns of different
>> hive tables and then perform whatever our UDF does but without using UDF's
>> at all? I looked at transformation scripts a while back but I don't think
>> those would work for our use case either.
>>
>> On Tue, Aug 25, 2015 at 5:05 PM, Jason Dere <jd...@hortonworks.com>
>> wrote:
>>
>>> For getting the configuration without configure(), this may not be the
>>> best thing to do but you can try during your UDF's initialize() method.
>>> Note that initialize() is called during query compilation, and also by each
>>> M/R task (followed at some point by configure()).
>>>
>>> ​During initialize() you can call SessionState.get(), if it is not null
>>> then this initialize() call is happening during query compilation, and you
>>> can then use SessionState.get().getConf() to get the configuration.
>>> GenericUDFBaseNumeric has an example of this.
>>>
>>>
>>>
>>> As for trying to force map/reduce jobs, you can try
>>> hive.fetch.task.conversion=minimal/none and
>>> hive.optimize.constant.propagation=false and see how it works.
>>>
>>>
>>> ------------------------------
>>> *From:* Rahul Sharma <kippy....@gmail.com>
>>> *Sent:* Tuesday, August 25, 2015 2:48 PM
>>> *To:* user@hive.apache.org
>>> *Subject:* Re: UDF Configure method not getting called
>>>
>>> Or alternatively, is there a way to pass configuration without using the
>>> configure method?
>>>
>>> The configuration to the UDF is essentially a list of parameters that
>>> tells the UDF, what it should morph into this time and what kind of work it
>>> should perform. If there is an all encompassing way to do that, then I can
>>> modify the UDF to run irrespective if its run locally or with MapRed
>>> context.
>>>
>>> On Tue, Aug 25, 2015 at 2:44 PM, Rahul Sharma <kippy....@gmail.com>
>>> wrote:
>>>
>>>> Oh thanks for the reply, Jason. That was my suspicion too.
>>>>
>>>> The UDF in our case is not a function per say in pure mathematical
>>>> sense of the word 'function'. That is because, it doesn't take in a value
>>>> and give out another value. It has side effects, that form input for
>>>> another MapReduce job. The point of doing it this way is that we wanted to
>>>> make use of the parallelism that would be afforded by running it as a map
>>>> reduce job via hive, as the processing is fairly compute extensive.
>>>>
>>>> Is there a way to force map-reduce jobs? I think
>>>> hive.fetch.task.conversion to minimal might help, is there anything that
>>>> can be done?
>>>>
>>>> Thanks a ton.
>>>>
>>>> On Tue, Aug 25, 2015 at 2:36 PM, Jason Dere <jd...@hortonworks.com>
>>>> wrote:
>>>>
>>>>> ​There might be a few cases where a UDF is executed locally and not as
>>>>> part of a Map/Reduce job​:
>>>>>
>>>>>  - Hive might choose not to run a M/R task for your query (see
>>>>> hive.fetch.task.conversion)
>>>>>
>>>>>  - If the UDF is deterministic and has deterministic inputs, Hive
>>>>> might decide to run the UDF once to get the value and use constant folding
>>>>> to replace calls of that UDF with the value from the one UDF call (see
>>>>> *hive.optimize.constant.propagation​)*
>>>>>
>>>>>
>>>>> Taking a look at the explain plan for you query might confirm this. In
>>>>> those cases the UDF would not run within a M/R task and configure() would
>>>>> not be called.
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> *From:* Rahul Sharma <kippy....@gmail.com>
>>>>> *Sent:* Tuesday, August 25, 2015 11:32 AM
>>>>> *To:* user@hive.apache.org
>>>>> *Subject:* UDF Configure method not getting called
>>>>>
>>>>> Hi Guys,
>>>>>
>>>>> We have a UDF which extends GenericUDF and does some configuration
>>>>> within the public void configure(MapredContext ctx) method.
>>>>>
>>>>> MapredContext in configure method gives access to the
>>>>> HiveConfiguration via JobConf, which contains custom attributes of the 
>>>>> form
>>>>> xy.abc.something. Reading these values is required for the semantics of 
>>>>> the
>>>>> UDF.
>>>>>
>>>>> Everything works fine till Hive 0.13, however with Hive 0.14 (or 1.0)
>>>>> the configure method of the UDF is never called by the runtime and hence
>>>>> the UDF cannot configure itself dynamically.
>>>>>
>>>>> Is this the intended behavior? If so, what is the new way to read
>>>>> configuration of the Map Reduce Job within the UDF?
>>>>>
>>>>> I would be grateful for any help.
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to