alamb commented on code in PR #73: URL: https://github.com/apache/datafusion-site/pull/73#discussion_r2134828346
########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions Review Comment: I think we could make the title a bit more specific. Maybe something like ```suggestion title: Custom types in DataFusion using Metadata ``` ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I write a function that computes the force of gravity on an object +based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. Suppose +our documentation for the function specifies the output will be in Newtons. This is only +valid if the input unit is in kilograms. With our metadata enhancement, we could update +this function to now evaluate the input units, perform any kind of required +transformation, and give consistent output every time. We could also have the function +return an error if an invalid input was given, such as providing an input where the +metadata says the units are in `meters` instead of a unit of mass. + +One common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. You could use metadata to specify +the encoding of the image data so you could use the appropriate decoder. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field + +# How to use metadata in user defined functions + +Using input metadata occurs in two different phases of a user defined function. Both during +the planning phase and execution, we have access to these field information. This allows +the user to determine the appropriate output fields during planning and to validate the +input. For other use cases, it may only be necessary to access these fields during execution. +We leave this open to the user. + +For all types of user defined functions we now evaluate the output [Field] as well. You can +specify this to create your own metadata from your functions or to pass through metadata from +one or more of your inputs. + +In addition to metadata the input field information carries nullability. With these you can +create more expressive nullability of your output data instead of having a single output. +For example, you could write a function to convert a string to uppercase. If we know the +input field is non-nullable, then we can set the output field to non-nullable as well. + +# Extension types + +TODO + +# Working with literals + +TODO + +# Thanks to our sponsor + +We would like to thank [Rerun.io] for sponsoring the development of this work. [Rerun.io] +is building a data visualization system for Physical AI and uses metadata to specify +context about columns in Arrow record batches. + +[Rerun.io]: https://rerun.io + +# Conclusion Review Comment: I recommend ending with a 🎣 expedition as always "This feature is still evolving and we would love you to come test it out, help us implement improvements, and document it. We are a welcoming community, etc" Basically the standard plea for help :) ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each Review Comment: I recommend making a diagram showing the relationship between schema, DataType, Nullabuility, and metadata I can commission an ASCII art version for you if you like :) ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I write a function that computes the force of gravity on an object +based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. Suppose +our documentation for the function specifies the output will be in Newtons. This is only +valid if the input unit is in kilograms. With our metadata enhancement, we could update +this function to now evaluate the input units, perform any kind of required +transformation, and give consistent output every time. We could also have the function +return an error if an invalid input was given, such as providing an input where the +metadata says the units are in `meters` instead of a unit of mass. + +One common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. You could use metadata to specify +the encoding of the image data so you could use the appropriate decoder. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field + +# How to use metadata in user defined functions + +Using input metadata occurs in two different phases of a user defined function. Both during +the planning phase and execution, we have access to these field information. This allows +the user to determine the appropriate output fields during planning and to validate the +input. For other use cases, it may only be necessary to access these fields during execution. +We leave this open to the user. + +For all types of user defined functions we now evaluate the output [Field] as well. You can +specify this to create your own metadata from your functions or to pass through metadata from +one or more of your inputs. + +In addition to metadata the input field information carries nullability. With these you can +create more expressive nullability of your output data instead of having a single output. +For example, you could write a function to convert a string to uppercase. If we know the +input field is non-nullable, then we can set the output field to non-nullable as well. + +# Extension types Review Comment: Here would be a great lead in to the actual example ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. + +TODO: UPDATE LINKS + +[DataFusion 48.0.0]: https://github.com/apache/datafusion/tree/48.0.0-rc3 + +# Why metadata handling is important + +Data in Arrow record batches carry a `Schema` in addition to the Arrow arrays. Each +[Field] in this `Schema` contains a name, data type, nullability, and metadata. The +metadata is specified as a map of key-value pairs of strings. In the new +implementation, during processing of all user defined functions we pass the input +field information. + +It is often desirable to write a generic function for reuse. With the prior version of +user defined functions, we only had access to the `DataType` of the input columns. This +works well for some features that only rely on the types of data. Other use cases may +need additional information that describes the data. + +For example, suppose I write a function that computes the force of gravity on an object +based on it's mass. The general equation is `F = m * g` where `g = 9.8 m/s`. Suppose +our documentation for the function specifies the output will be in Newtons. This is only +valid if the input unit is in kilograms. With our metadata enhancement, we could update +this function to now evaluate the input units, perform any kind of required +transformation, and give consistent output every time. We could also have the function +return an error if an invalid input was given, such as providing an input where the +metadata says the units are in `meters` instead of a unit of mass. + +One common application of metadata handling is understanding encoding of a blob of data. +Suppose you have a column that contains image data. You could use metadata to specify +the encoding of the image data so you could use the appropriate decoder. + +[field]: https://arrow.apache.org/docs/format/Glossary.html#term-field + +# How to use metadata in user defined functions + +Using input metadata occurs in two different phases of a user defined function. Both during +the planning phase and execution, we have access to these field information. This allows +the user to determine the appropriate output fields during planning and to validate the +input. For other use cases, it may only be necessary to access these fields during execution. +We leave this open to the user. + +For all types of user defined functions we now evaluate the output [Field] as well. You can +specify this to create your own metadata from your functions or to pass through metadata from +one or more of your inputs. + +In addition to metadata the input field information carries nullability. With these you can +create more expressive nullability of your output data instead of having a single output. +For example, you could write a function to convert a string to uppercase. If we know the +input field is non-nullable, then we can set the output field to non-nullable as well. + +# Extension types + +TODO + +# Working with literals + +TODO + +# Thanks to our sponsor Review Comment: 💯 thank you rerun ❤️ ########## content/blog/2025-06-09-metadata-handling.md: ########## @@ -0,0 +1,98 @@ +--- +layout: post +title: Metadata handling in user defined functions +date: 2025-06-09 +author: Tim Saucer +categories: [core] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %}x +--> + +[DataFusion 48.0.0] introduced a change in the interface for writing custom functions +which enables a variety of interesting improvements. Now users can access additional +data about the input columns to functions, such as their nullability and metadata. This +enables processing of extension types as well as a wide variety of other use cases. Review Comment: I think technically it was possible in DataFusion 47 and earlier to specify nullability, but it was much less unified with the Arrow type system. I suggest we emphasize the core usecase some more here: supporting user defined data types, by annotating built in arrow types with addtional metadata -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org