Would like to bring this back for consideration again. I'm open to adding types for all the parameters, but it does seem onerous, and in the case of Python, we don't do that. Do you feel strongly about adding them?
On Sat, May 30, 2015 at 8:04 PM Reynold Xin <r...@databricks.com> wrote: > We added all the typetags for arguments but haven't got around to use them > yet. I think it'd make sense to have them and do the auto cast, but we can > have rules in analysis to forbid certain casts (e.g. don't auto cast double > to int). > > > On Sat, May 30, 2015 at 7:12 AM, Justin Uang <justin.u...@gmail.com> > wrote: > >> The idea of asking for both the argument and return class is interesting. >> I don't think we do that for the scala APIs currently, right? In >> functions.scala, we only use the TypeTag for RT. >> >> def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): >> UserDefinedFunction = { >> UserDefinedFunction(f, >> ScalaReflection.schemaFor(typeTag[RT]).dataType) >> } >> >> There would only be a small subset of conversions that would make sense >> implicitly (e.g. int to double, the typical conversions in programming >> languages), but something like (double => int) might be dangerous and >> (timestamp => double) wouldn't really make sense. Perhaps it's better to be >> explicit about casts? >> >> If we don't care about declaring the types of the arguments, perhaps we >> can have all of the java UDF interfaces (UDF1, UDF2, etc) extend a generic >> interface called UDF, then have >> >> def define(f: UDF, returnType: Class[_]) >> >> to simplify the APIs. >> >> >> On Sat, May 30, 2015 at 3:43 AM Reynold Xin <r...@databricks.com> wrote: >> >>> I think you are right that there is no way to call Java UDF without >>> registration right now. Adding another 20 methods to functions would be >>> scary. Maybe the best way is to have a companion object >>> for UserDefinedFunction, and define UDF there? >>> >>> e.g. >>> >>> object UserDefinedFunction { >>> >>> def define(f: org.apache.spark.api.java.function.Function0, >>> returnType: Class[_]): UserDefinedFunction >>> >>> // ... define a few more - maybe up to 5 arguments? >>> } >>> >>> Ideally, we should ask for both argument class and return class, so we >>> can do the proper type conversion (e.g. if the UDF expects a string, but >>> the input expression is an int, Catalyst can automatically add a cast). >>> However, we haven't implemented those in UserDefinedFunction yet. >>> >>> >>> >>> >>> On Fri, May 29, 2015 at 12:54 PM, Justin Uang <justin.u...@gmail.com> >>> wrote: >>> >>>> I would like to define a UDF in Java via a closure and then use it >>>> without registration. In Scala, I believe there are two ways to do this: >>>> >>>> myUdf = functions.udf({ _ + 5}) >>>> myDf.select(myUdf(myDf("age"))) >>>> >>>> or >>>> >>>> myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType, >>>> myDf("age"))) >>>> >>>> However, both of these don't work for Java UDF. The first one requires >>>> TypeTags. For the second one, I was able to hack it by creating a scala >>>> AbstractFunction1 and using callUDF, which requires declaring the catalyst >>>> DataType instead of using TypeTags. However, it was still nasty because I >>>> had to return a scala map instead of a java map. >>>> >>>> Is there first class support for creating >>>> a org.apache.spark.sql.UserDefinedFunction that works with >>>> the org.apache.spark.sql.api.java.UDF1<T1, R>? I'm fine with having to >>>> declare the catalyst type when creating it. >>>> >>>> If it doesn't exist, I would be happy to work on it =) >>>> >>>> Justin >>>> >>> >>> >