Having a way to register UDFs that are not using Hive APIs would be great!

On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> 
> 
> Hi everyone,
> I’ve been looking into improving how users of our Spark platform register
> and use UDFs and I’d like to discuss a few ideas for making this easier.
> 
> 
> 
> The motivation for this is the use case of defining a UDF from SparkSQL or
> PySpark. We want to make it easy to write JVM UDFs and use them from both
> SQL and Python. Python UDFs work great in most cases, but we occasionally
> don’t want to pay the cost of shipping data to python and processing it
> there so we want to make it easy to register UDFs that will run in the
> JVM.
> 
> 
> 
> There is already syntax to create a function from a JVM class (
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
> ) in SQL that would work, but this option requires using the Hive UDF API
> instead of Spark’s simpler Scala API. It also requires argument
> translation and doesn’t support codegen. Beyond the problem of the API and
> performance, it is annoying to require registering every function
> individually with a CREATE FUNCTION statement.
> 
> 
> 
> The alternative that I’d like to propose is to add a way to register a
> named group of functions using the proposed catalog plugin API.
> 
> 
> 
> For anyone unfamiliar with the proposed catalog plugins, the basic idea is
> to load and configure plugins using a simple property-based scheme. Those
> plugins expose functionality through mix-in interfaces, like TableCatalog to
> create/drop/load/alter tables. Another interface could be UDFCatalog that
> can load UDFs.
> 
> interface UDFCatalog extends CatalogPlugin { UserDefinedFunction
> loadUDF(String name) }
> 
> To use this, I would create a UDFCatalog class that returns my Scala
> functions as UDFs. To look up functions, we would use both the catalog
> name and the function name.
> 
> 
> 
> This would allow my users to write Scala UDF instances, package them using
> a UDFCatalog class (provided by me), and easily use them in Spark with a
> few configuration options to add the catalog in their environment.
> 
> 
> 
> This would also allow me to expose UDF libraries easily in my
> configuration, like brickhouse (
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Leveraging-Brickhouse-in-Spark2-pivot/m-p/59943
> ) , without users needing to ensure the Jar is loaded and register
> individual functions.
> 
> 
> 
> Any thoughts on this high-level approach? I know that this ignores things
> like creating and storing functions in a FunctionCatalog , and we’d have to
> solve challenges with function naming (whether there is a db component).
> Right now I’d like to think through the overall idea and not get too
> focused on those details.
> 
> 
> 
> Thanks,
> 
> 
> 
> rb
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to