Hi all, I've spend some cycle on the collation discussion and make something more concrete to react to: a spec-change PR plus reference implementations (go and java).
- Spec change (apache/iceberg#16972): a "collation" annotation on string fields, and a data_file.collation_bounds field so collated columns stay prunable. - Reference implementation in iceberg-go (apache/iceberg-go#1318): the full path end to end - schema annotation, collation-aware comparison (CLDR/UCA), collation bounds in the manifest, and version-gated data-file pruning, with an Avro round-trip and pruning tests. - A lightweight Java POC (link below): the schema annotation plus a Collator-backed comparator, to match where the discussion is. I deliberately left the manifest/bounds side out of Java for now. The design follows the original proposal but takes a few different turns, mostly to adopt what we learned in Delta. The ones I'd most like input on: 1 - Bounds store original values, not sort keys, tagged with a per-file collation version. ICU/CLDR sort keys aren't stable across versions, so storing keys ties every reader to one exact version; original values plus a per-file version (readers prune only on an exact match) degrade gracefully instead of breaking. The schema keeps the collation name unversioned so anyone can read. 2 - A provider-qualified identifier (icu.en_US-ci), leaving room for non-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU as the sole provider. 3 - One structural question I don't have a strong opinion on yet: I put collation_bounds on data_file as a standalone v3 field, but field id 146 is already the v4 content_stats struct, and collation bounds might belong inside that typed-stats framework instead. Worth settling before we fix field ids. The full set of differences and the reader/writer rules are in the PR description and the write-up. Comments very welcome — both on the calls above and on whether the standalone-field vs content_stats direction is the right one. Best, Andrei - original proposal: https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0 - spec change: https://github.com/apache/iceberg/pull/16972 - POC in go: https://github.com/apache/iceberg-go/pull/1318 - java POC: https://github.com/laskoviymishka/iceberg/tree/prototype/collation-support On Mon, Mar 30, 2026 at 10:54 PM Alexander Löser <[email protected]> wrote: > Hi Andrei, > > I'm glad you're interested. Looking forward to collaborate with you! > Thanks for all the feedback here and in the doc. I only had a quick > glance, but I think you raised some good points. I'll address/respond to > your comments as soon as I get the chance, hopefully tomorrow. > I think you also left some comments in this mail that are not yet in the > doc - I'll move those to a dedicated section at the end of the doc, so we > can use the doc as a single source of truth/discussion. > > > Happy to share our Delta design doc and implementation learnings in more > detail. > > Sure, sounds good :) > > Best, > Alex > On 3/29/26 01:25, Andrei Tserakhau via dev wrote: > > Hi Alexander, > > This looks really interesting. We've been working on collation support in > Delta and have shipped it in production for some time, so this is an area > we care about a lot. If this proposal moves forward we'd be happy to > collaborate on the design and implementation. > > The pseudo-field approach for collation metrics is clean and composes well > with existing Iceberg infrastructure. The specifier coverage is > comprehensive. > > A few areas worth discussing as this evolves: > > 1 - Sort key stability and versioning > > ICU sort keys are not stable across versions, so a pinned ICU version bump > in a future Iceberg release would invalidate all existing collation > metrics. In multi-engine environments, requiring all engines to converge on > one ICU version is unrealistic. > > We store original string values instead of sort keys and allow per-file > version annotations -- worth discussing whether something similar could > work here. > > 2 - Provider abstraction > > The proposal assumes ICU as the sole provider, but Spark ships non-ICU > collations like UTF8_LCASE that are widely used. A provider or namespace > layer would prevent name collisions and support engine-specific collations > without future spec changes. > > 3 - Operational surface > > A few things that turned out correctness-critical in our implementation: > partition transforms on collated columns (collation-equal but byte-distinct > values in different directories), sort order semantics, equality deletes > under collation, and Parquet filter pushdown (must be disabled since > Parquet has no collation concept). > > These don't all need to be solved in v1 but would help to scope them. > > 4 - Smaller items (nit's) > > UTF-8 bounds for the original field id should be "must write" not "should" > -- otherwise backward compat breaks for non-aware engines. Engine fallback > behavior (case-sensitive vs older ICU vs fail) could use a recommended > preference order to avoid divergent results across engines. The collation > specifier syntax would benefit from a formal grammar. > > --- > > Happy to share our Delta design doc and implementation learnings in more > detail. Looking forward to the discussion. > > Best, > Andrei > > On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser <[email protected]> > wrote: > >> Hi everyone, >> >> this is my first interaction with the Iceberg community, so here a few >> words about myself: >> - I'm Alex, a Berlin-based software engineer >> - I've been working at Snowflake for 4 years now >> - I spend most of my time on data types, particularly binary, strings and >> collations. >> >> I'd like to start a discussion about adding collations to the Iceberg >> spec. >> >> Conceptually, collations are an annotation on the string data type. By >> default, most engines perform string operations case-sensitively. >> Collations allow specifying alternative comparison rules. This is useful >> for achieving, e.g., case- or accent-insensitive string operations, or >> language-specific string sorting. >> Collations are supported by many engines: Databricks >> <https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>, >> Spark >> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>, >> Snowflake <https://docs.snowflake.com/en/sql-reference/collation>, Oracle >> <https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html> >> - to >> name just a few - this list is not complete. >> >> In Snowflake, we see heavy use of the collation feature. Several users >> have approached us, mentioning they want to migrate to Iceberg tables, but >> are currently blocked by Iceberg's lack of collation support. >> >> Given the widespread support for collations across different engines, I >> believe introducing collations to Iceberg will increase interoperability >> and boost its adoption. >> I'd be curious about your thoughts. >> >> *Goal of the proposal* >> - Support collation specifications for columns >> - Define how collation bounds should be stored - UTF-8 based bounds are >> not useful for collated columns >> >> *Required Changes* >> - Extend the schema to let (string) fields be annotated with a collation >> >> More details can be found in this doc >> <https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k> >> . >> >> I'm also hoping to present the idea in the next community sync. >> >> Best, Alex >> >> >>
