rahil-c opened a new issue, #18819:
URL: https://github.com/apache/hudi/issues/18819
**What happened:**
`CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>')` followed by
INSERT both succeed silently when the keyed column is of BLOB type. The
resulting `_hoodie_record_key` is the JSON-stringified BLOB struct, e.g.
`{"type":"INLINE","data":"hello-0","reference":null}`.
BLOB is raw binary bytes (images, video, embeddings, or EXTERNAL references
to such payloads). It is not a valid record-key type semantically:
- For INLINE BLOBs, the key is the entire byte payload — for real-world
blobs (MB-sized images/video/embeddings) the key balloons proportionally,
blowing up shuffle bytes and metadata index (record index, secondary index,
bloom) storage.
- For EXTERNAL BLOBs, the key is derived from the storage path, so record
identity tracks path rather than content — moving or re-uploading the same blob
yields a different key.
**What you expected:**
Hudi should reject BLOB-typed columns as the record key, the same way other
unsupported key types are rejected. This is a type-level restriction — BLOB is
never an appropriate key, regardless of payload size or storage mode. The
rejection must cover both write paths:
- Spark DDL: `CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>')`
- Spark DataSource writes:
`.option("hoodie.datasource.write.recordkey.field", "<blob_col>")`
Both should fail fast with a clear error message identifying the BLOB column
and the unsupported-type reason.
**Steps to reproduce:**
1. Use 1.2.0-rc2 Spark bundle.
2. Either:
a. DDL path: `CREATE TABLE t (id BLOB, label STRING) USING hudi
TBLPROPERTIES (primaryKey = 'id')`
b. DataSource path:
`df.write.format("hudi").option("hoodie.datasource.write.recordkey.field",
"id").save(...)` with `id` of BLOB type.
3. INSERT / write a row with an INLINE BLOB value.
4. `SELECT _hoodie_record_key FROM t` → key is the JSON-serialized struct.
**Environment:**
- Hudi version: 1.2.0-rc2
- Query engine: Spark 3.5
- Found during: 1.2.0-rc2 RC voting testing — non-blocker, follow-up after
1.2.0 ships.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]