paleolimbot commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2135925472
########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. Review Comment: ```suggestion metadata is specified as a map of key-value pairs of strings. This extra metadata is used by Arrow implementations implement [extension types] and can also be used to add use case-specific context to a column of values where the formality of an extension type is not required. In previous versions of DataFusion field metadata was propagated through certain operations (e.g., renaming or selecting a column) but was not accessible to others (e.g., scalar, window, or aggregate function calls). In the new implementation, during processing of all user defined functions we pass the input field information and allow user defined function implementations to return field information to the caller. [extension types]: https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types ``` ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I write a function that computes the force of gravity on an object +based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. Suppose +our documentation for the function specifies the output will be in Newtons. This is only +valid if the input unit is in kilograms. With our metadata enhancement, we could update +this function to now evaluate the input units, perform any kind of required +transformation, and give consistent output every time. We could also have the function +return an error if an invalid input was given, such as providing an input where the +metadata says the units are in `meters` instead of a unit of mass. Review Comment: I wonder if we could turn this into a code example with DataFusion(Python?) UDFs to make it more concrete (I can help). Maybe a UDF called `uuid_version` or `uuid_timestamp` that extracts the embedded version or timestamp off of a UUID type (and a `uuid()` generating function)? (pyarrow and DuckDB both understand the arrow.uuid extension type out of the box which facilitates a nice interchange example where the uuid-ness isn't lost at the edges). The arbitrary key/value metadata use case is cool too (and I get that it's the use case that motivated this whole thing from your end!) but it's harder to find an in-the-wild example where a user can leverage this out of the box. The places I have run into this in the wild are basically data sources that write things there (like perhaps rerun) whose provider didn't know about extension types (e.g., the API Snowflake uses to get data from the server to its Python connector uses field metadata to communicate the Snowflake type information, whereas BigQuery's Arrow API uses extension types to communicate its type information). ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I write a function that computes the force of gravity on an object +based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. Suppose +our documentation for the function specifies the output will be in Newtons. This is only +valid if the input unit is in kilograms. With our metadata enhancement, we could update +this function to now evaluate the input units, perform any kind of required +transformation, and give consistent output every time. We could also have the function +return an error if an invalid input was given, such as providing an input where the +metadata says the units are in `meters` instead of a unit of mass. + +One common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. You could use metadata to specify +the encoding of the image data so you could use the appropriate decoder. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field + +# How to use metadata in user defined functions + +Using input metadata occurs in two different phases of a user defined function. Both during +the planning phase and execution, we have access to these field information. This allows +the user to determine the appropriate output fields during planning and to validate the +input. For other use cases, it may only be necessary to access these fields during execution. +We leave this open to the user. + +For all types of user defined functions we now evaluate the output [Field] as well. You can +specify this to create your own metadata from your functions or to pass through metadata from +one or more of your inputs. + +In addition to metadata the input field information carries nullability. With these you can +create more expressive nullability of your output data instead of having a single output. +For example, you could write a function to convert a string to uppercase. If we know the +input field is non-nullable, then we can set the output field to non-nullable as well. + +# Extension types + +TODO + +# Working with literals + +TODO Review Comment: The place where I use this is finding my values in optimizer rules (for example, a `cast(uuid_val, String)` could be replaced with a function that prettifies the UUID in the way UUIDs are usualy prettified). That's perhaps too complex for this post (perhaps this section doesn't need an example). ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions Review Comment: Maybe: "Field metadata and extension type support in user defined functions"? ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I write a function that computes the force of gravity on an object +based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. Suppose +our documentation for the function specifies the output will be in Newtons. This is only +valid if the input unit is in kilograms. With our metadata enhancement, we could update +this function to now evaluate the input units, perform any kind of required +transformation, and give consistent output every time. We could also have the function +return an error if an invalid input was given, such as providing an input where the +metadata says the units are in `meters` instead of a unit of mass. + +One common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. You could use metadata to specify +the encoding of the image data so you could use the appropriate decoder. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field + +# How to use metadata in user defined functions + +Using input metadata occurs in two different phases of a user defined function. Both during +the planning phase and execution, we have access to these field information. This allows +the user to determine the appropriate output fields during planning and to validate the +input. For other use cases, it may only be necessary to access these fields during execution. +We leave this open to the user. + +For all types of user defined functions we now evaluate the output [Field] as well. You can +specify this to create your own metadata from your functions or to pass through metadata from +one or more of your inputs. + +In addition to metadata the input field information carries nullability. With these you can +create more expressive nullability of your output data instead of having a single output. +For example, you could write a function to convert a string to uppercase. If we know the +input field is non-nullable, then we can set the output field to non-nullable as well. + Review Comment: Perhaps the first example could be high-level Python (where pyarrow takes care of the field metadata automagically), and this example could be Rust (where we'd have to check the content of the fields and/or assign them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org