Having a way to register UDFs that are not using Hive APIs would be great! On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue < rb...@netflix.com.invalid > wrote:
> > > > Hi everyone, > I’ve been looking into improving how users of our Spark platform register > and use UDFs and I’d like to discuss a few ideas for making this easier. > > > > The motivation for this is the use case of defining a UDF from SparkSQL or > PySpark. We want to make it easy to write JVM UDFs and use them from both > SQL and Python. Python UDFs work great in most cases, but we occasionally > don’t want to pay the cost of shipping data to python and processing it > there so we want to make it easy to register UDFs that will run in the > JVM. > > > > There is already syntax to create a function from a JVM class ( > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > ) in SQL that would work, but this option requires using the Hive UDF API > instead of Spark’s simpler Scala API. It also requires argument > translation and doesn’t support codegen. Beyond the problem of the API and > performance, it is annoying to require registering every function > individually with a CREATE FUNCTION statement. > > > > The alternative that I’d like to propose is to add a way to register a > named group of functions using the proposed catalog plugin API. > > > > For anyone unfamiliar with the proposed catalog plugins, the basic idea is > to load and configure plugins using a simple property-based scheme. Those > plugins expose functionality through mix-in interfaces, like TableCatalog to > create/drop/load/alter tables. Another interface could be UDFCatalog that > can load UDFs. > > interface UDFCatalog extends CatalogPlugin { UserDefinedFunction > loadUDF(String name) } > > To use this, I would create a UDFCatalog class that returns my Scala > functions as UDFs. To look up functions, we would use both the catalog > name and the function name. > > > > This would allow my users to write Scala UDF instances, package them using > a UDFCatalog class (provided by me), and easily use them in Spark with a > few configuration options to add the catalog in their environment. > > > > This would also allow me to expose UDF libraries easily in my > configuration, like brickhouse ( > https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Leveraging-Brickhouse-in-Spark2-pivot/m-p/59943 > ) , without users needing to ensure the Jar is loaded and register > individual functions. > > > > Any thoughts on this high-level approach? I know that this ignores things > like creating and storing functions in a FunctionCatalog , and we’d have to > solve challenges with function naming (whether there is a db component). > Right now I’d like to think through the overall idea and not get too > focused on those details. > > > > Thanks, > > > > rb > > > -- > Ryan Blue > Software Engineer > Netflix >