Re: [Discussion] Collation Support

Andrei Tserakhau via dev Fri, 26 Jun 2026 03:29:15 -0700

Hi all,

I've spend some cycle on the collation discussion and make something more
concrete to react to: a spec-change PR plus reference implementations (go
and java).


- Spec change (apache/iceberg#16972): a "collation" annotation on string
fields, and a data_file.collation_bounds field so collated columns stay
prunable.
- Reference implementation in iceberg-go (apache/iceberg-go#1318): the full
path end to end - schema annotation, collation-aware comparison (CLDR/UCA),
collation bounds in the manifest, and version-gated data-file pruning, with
an Avro round-trip and pruning tests.
- A lightweight Java POC (link below): the schema annotation plus a
Collator-backed comparator, to match where the discussion is. I
deliberately left the manifest/bounds side out of Java for now.

The design follows the original proposal but takes a few different turns,
mostly to adopt what we learned in Delta. The ones I'd most like input on:

1 - Bounds store original values, not sort keys, tagged with a per-file
collation version. ICU/CLDR sort keys aren't stable across versions, so
storing keys ties every reader to one exact version; original values plus a
per-file version (readers prune only on an exact match) degrade gracefully
instead of breaking. The schema keeps the collation name unversioned so
anyone can read.

2 - A provider-qualified identifier (icu.en_US-ci), leaving room for
non-ICU collations like Spark's UTF8_LCASE, rather than assuming ICU as the
sole provider.

3 - One structural question I don't have a strong opinion on yet: I put
collation_bounds on data_file as a standalone v3 field, but field id 146 is
already the v4 content_stats struct, and collation bounds might belong
inside that typed-stats framework instead. Worth settling before we fix
field ids.

The full set of differences and the reader/writer rules are in the PR
description and the write-up. Comments very welcome — both on the calls
above and on whether the standalone-field vs content_stats direction is the
right one.

Best, Andrei

- original proposal:
https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0
- spec change: https://github.com/apache/iceberg/pull/16972
- POC in go: https://github.com/apache/iceberg-go/pull/1318
- java POC:
https://github.com/laskoviymishka/iceberg/tree/prototype/collation-support

On Mon, Mar 30, 2026 at 10:54 PM Alexander Löser <[email protected]>
wrote:

> Hi Andrei,
>
> I'm glad you're interested. Looking forward to collaborate with you!
> Thanks for all the feedback here and in the doc. I only had a quick
> glance, but I think you raised some good points.  I'll address/respond to
> your comments as soon as  I get the chance, hopefully tomorrow.
> I think you also left some comments in this mail that are not yet in the
> doc - I'll move those to a dedicated section at the end of the doc, so we
> can use the doc as a single source of truth/discussion.
>
> > Happy to share our Delta design doc and implementation learnings in more
> detail.
>
> Sure, sounds good :)
>
> Best,
> Alex
> On 3/29/26 01:25, Andrei Tserakhau via dev wrote:
>
> Hi Alexander,
>
> This looks really interesting. We've been working on collation support in
> Delta and have shipped it in production for some time, so this is an area
> we care about a lot. If this proposal moves forward we'd be happy to
> collaborate on the design and implementation.
>
> The pseudo-field approach for collation metrics is clean and composes well
> with existing Iceberg infrastructure. The specifier coverage is
> comprehensive.
>
> A few areas worth discussing as this evolves:
>
> 1 - Sort key stability and versioning
>
> ICU sort keys are not stable across versions, so a pinned ICU version bump
> in a future Iceberg release would invalidate all existing collation
> metrics. In multi-engine environments, requiring all engines to converge on
> one ICU version is unrealistic.
>
> We store original string values instead of sort keys and allow per-file
> version annotations -- worth discussing whether something similar could
> work here.
>
> 2 - Provider abstraction
>
> The proposal assumes ICU as the sole provider, but Spark ships non-ICU
> collations like UTF8_LCASE that are widely used. A provider or namespace
> layer would prevent name collisions and support engine-specific collations
> without future spec changes.
>
> 3 - Operational surface
>
> A few things that turned out correctness-critical in our implementation:
> partition transforms on collated columns (collation-equal but byte-distinct
> values in different directories), sort order semantics, equality deletes
> under collation, and Parquet filter pushdown (must be disabled since
> Parquet has no collation concept).
>
> These don't all need to be solved in v1 but would help to scope them.
>
> 4 - Smaller items (nit's)
>
> UTF-8 bounds for the original field id should be "must write" not "should"
> -- otherwise backward compat breaks for non-aware engines. Engine fallback
> behavior (case-sensitive vs older ICU vs fail) could use a recommended
> preference order to avoid divergent results across engines. The collation
> specifier syntax would benefit from a formal grammar.
>
> ---
>
> Happy to share our Delta design doc and implementation learnings in more
> detail. Looking forward to the discussion.
>
> Best,
> Andrei
>
> On Sat, Mar 28, 2026 at 11:49 PM Alexander Löser <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> this is my first interaction with the Iceberg community, so here a few
>> words about myself:
>> - I'm Alex, a Berlin-based software engineer
>> - I've been working at Snowflake for 4 years now
>> - I spend most of my time on data types, particularly binary, strings and
>> collations.
>>
>> I'd like to start a discussion about adding collations to the Iceberg
>> spec.
>>
>> Conceptually, collations are an annotation on the string data type. By
>> default, most engines perform string operations case-sensitively.
>> Collations allow specifying alternative comparison rules. This is useful
>> for achieving, e.g., case- or accent-insensitive string operations, or
>> language-specific string sorting.
>> Collations are supported by many engines: Databricks
>> <https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation>,
>> Spark
>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collate.html>,
>> Snowflake <https://docs.snowflake.com/en/sql-reference/collation>, Oracle
>> <https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/COLLATION.html>
>>  - to
>> name just a few - this list is not complete.
>>
>> In Snowflake, we see heavy use of the collation feature. Several users
>> have approached us, mentioning they want to migrate to Iceberg tables, but
>> are currently blocked by Iceberg's lack of collation support.
>>
>> Given the widespread support for collations across different engines, I
>> believe introducing collations to Iceberg will increase interoperability
>> and boost its adoption.
>> I'd be curious about your thoughts.
>>
>> *Goal of the proposal*
>> - Support collation specifications for columns
>> - Define how collation bounds should be stored - UTF-8 based bounds are
>> not useful for collated columns
>>
>> *Required Changes*
>> - Extend the schema to let (string) fields be annotated with a collation
>>
>> More details can be found in this doc
>> <https://docs.google.com/document/d/1m8b7u97uteHYjXk-4DNglJSpQO8OcZOCzW2tApCNTW4/edit?tab=t.0#heading=h.y1ant4w2163k>
>> .
>>
>> I'm also hoping to present the idea in the next community sync.
>>
>> Best, Alex
>>
>>
>>

Re: [Discussion] Collation Support

Reply via email to