Put up a PR here https://github.com/apache/iceberg/pull/16527
On Thu, 21 May 2026 at 01:24, Yufei Gu <[email protected]> wrote: > Hi Xander, > > Thanks for digging into this and documenting the current behavior so > clearly. > > +1 on putting these formats into the spec. At least from an > interoperability perspective, the current situation creates a practical gap > between "spec compliant" and "cross-implementation compatible." > > Yufei > > > On Wed, May 20, 2026 at 3:14 PM Alexander Bailey <[email protected]> > wrote: > >> Hi all, >> >> While implementing table encryption in iceberg-rust, we found a couple >> of undocumented formats that are required for interoperability but are >> described in the spec only as "implementation-specific." We >> have reverse-engineered these from Java's implementation to achieve >> byte-compatibility. Any future implementation (PyIceberg, etc.) would need >> to do the same. >> >> I'd like to propose that we specify the following in the spec, likely as >> a new appendix or an expansion of the encryption section. >> >> 1. StandardKeyMetadata — the file-level key metadata format >> >> The `key_metadata` binary field (field 131 in data files, field 519 in >> manifest lists) uses a versioned Avro encoding in Java's >> `StandardKeyMetadata`: >> >> Wire format: `[version: 1 byte (0x01)] [Avro binary datum]` >> >> V1 schema: >> ``` >> required(0, "encryption_key", binary) -- plaintext DEK >> optional(1, "aad_prefix", binary) -- per-file AAD prefix for AES-GCM >> optional(2, "file_length", long) -- encrypted file size (for streaming >> decryption) >> ``` >> >> 2. The encryption-keys list — KEKs vs wrapped DEKs >> >> The table-level `encryption-keys` list stores two kinds of entries, >> distinguished by what `encrypted-by-id` points to: >> >> **KEK entries** (`encrypted-by-id` = table master key ID): >> - `encrypted-key-metadata`: the KEK wrapped by the KMS (opaque, >> KMS-specific format) >> - `properties`: includes `"key-timestamp"` (epoch millis) for expiration >> >> **Wrapped manifest-list DEK entries** (`encrypted-by-id` = a KEK's >> key-id): >> - `encrypted-key-metadata`: the `StandardKeyMetadata` Avro bytes (from #1 >> above) encrypted with AES-GCM using the referenced KEK, with the KEK's >> timestamp string as AAD >> - `properties`: empty >> >> The convention for distinguishing these two types of entries, and the >> wrapping scheme (AES-GCM with the KEK timestamp as AAD to prevent >> tampering), are not documented anywhere in the spec from what I can see. >> >> 3. What can stay "implementation-specific" >> >> The KEK's `encrypted-key-metadata` is intentionally opaque, it's whatever >> the KMS returns from `wrapKey`. That's fine to leave unspecified since it's >> between the implementation and its KMS provider. >> >> ### Why this matters >> >> Without specifying #1 and #2, "implementation-specific" becomes a >> practical interop barrier: tables encrypted by one implementation would be >> unreadable by another despite both being spec-compliant. These formats are >> already versioned and frozen in Java - the spec would just be documenting >> existing reality. >> >> Would there be interest in a PR for this? Happy to draft it. >> >> Thanks, >> Xander >> >
