Hadoop tasks use a single thread, so there won't be multiple threads accessing the UDF.
However, there's a flip side of thread safety if your UDF maintains state; is it receiving all the data it should or is the data being sharded over multiple processes in a way that defeats the UDF? My favorite example is a moving average calculator (like you might use in Finance). Most full-featured SQLs have window functions for this purpose. Suppose I'm averaging over the last 50 closing prices for a given financial instrument. To do this I cache the last 50 I've seen in the UDF as each record is passed to me (keeping the data for each instrument properly separated). If some records go to one mapper task and other records go to a different mapper task, then at least some of my averages will be wrong due to missing data. dean On Sun, Mar 10, 2013 at 10:12 PM, Shaun Clowes <sclo...@atlassian.com>wrote: > Hi All, > > Could anyone describe what the required thread safety for a UDF is? I > understand that one is instantiated for each use of the function in an > expression, but can there be multiple threads executing the methods of a > single UDF object at once? > > Thanks, > Shaun > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330