Hi all,

Issue #16238 reports that hidden partition fields get different default
names depending on how the partition is created, and I'd like the
community's view on whether and how to reconcile this before I put up a PR.

The two paths diverge for the parameterized transforms (bucket and
truncate):

- Creating a table (including via Spark createOrReplace, which routes
through Spark3Util.toPartitionSpec -> PartitionSpec.Builder) generates
names without the parameter: col_bucket, col_trunc.
- ALTER TABLE ADD PARTITION FIELD (UpdatePartitionSpec ->
BaseUpdatePartitionSpec.PartitionNameGenerator) generates names with the
parameter: col_bucket_<n>, col_trunc_<width>.

Time-based transforms (year/month/day/hour) already agree across both
paths. For reference, the partition-spec example in the spec
(format/spec.md) uses the no-parameter form: "name": "id_bucket" for a
bucket[16] transform.

Both forms are currently asserted by tests on each side, so this looks like
a deliberate-but-unreconciled difference rather than an accidental bug, and
standardizing it would be a user-visible, cross-engine behavior change. I
see two coherent directions:

1. Align ALTER to the creation/spec form (col_bucket). Smaller footprint
and matches the spec example. The _<n> suffix in BaseUpdatePartitionSpec
mainly disambiguates adding two different bucket widths to the same source
column within one spec (the case pinned by testAddMultipleBuckets);
collisions with previously-dropped (void) fields are already handled
separately by renaming them to name_<fieldId>. So this direction could keep
the bare col_bucket name and append _<n> only on an actual name conflict,
preserving the multi-width case while making the common ALTER match
creation.

2. Align creation to the ALTER form (col_bucket_<n>), which is the
preference noted in the issue and makes the builder strictly more
expressive. The downside is that it changes the default name for the most
common path across every engine, deviates from the spec example, and
touches a large number of existing tests.

A third option is to leave the behavior as-is and document it as intended.

Separately, the "add multiple widths on the same source column" behavior is
currently covered by a positive test only for bucket
(testAddMultipleBuckets), not for truncate, even though both behave
identically.

My inclination is option 1 (with the conflict-only suffix), but I don't
want to pick a direction unilaterally given it touches a long-standing,
cross-engine convention. Could folks weigh in on the preferred direction?

Issue: https://github.com/apache/iceberg/issues/16238

Thanks,
Vova Kolmakov (wombatu-kun)

Reply via email to