Hello Arrow devs,

I'm working on some breaking changes for how C++ handles type equality with
the field names within ListType and MapType. [1] I call these "internal
field names", since--unlike the fields in StructType--they often don't
provide much information that isn't already implied by their position.
(Though Dewey did note some exceptions to this in the Jira. [2])

The PR adds an option called "check_internal_field_names" to configure
whether to check equality of these names when checking equality of these
types. Currently, we do this inconsistently: we always check them for
ListType but never for MapType.

In the C++ implementation, we have two equals methods: a strict one and a
loose one. The strict one, TypeEqual, checks field metadata by default,
while the loose one, DataType.Equals, does not. Given this precedent, I
made the default for "check_internal_field_names"  align with the defaults
of the "check_metadata" flag. But it's worth noting that these settings are
configurable in either method; they just have different defaults.

The motivation for this work is in anticipation of turning on compliant
nested types in Parquet. [3] Parquet requires that ListTypes are always
written with the "element" field name and has specific requirements for
MapType as well. This means these fields can lose their field names when
roundtripped through Parquet, so it's helpful to be able to check equality
while ignoring these field names.

Of course, changes like these can have unintended consequences, so I wanted
to alert other developers. If you have feedback or concerns, please discuss.

Best,

Will Jones

[1] https://github.com/apache/arrow/pull/13851
[2]
https://issues.apache.org/jira/browse/ARROW-14999?focusedCommentId=17581439&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17581439
[3] https://issues.apache.org/jira/browse/ARROW-14196

Reply via email to