rahil-c opened a new issue, #18819:
URL: https://github.com/apache/hudi/issues/18819

   **What happened:**
   `CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>')` followed by 
INSERT both succeed silently when the keyed column is of BLOB type. The 
resulting `_hoodie_record_key` is the JSON-stringified BLOB struct, e.g. 
`{"type":"INLINE","data":"hello-0","reference":null}`.
   
   BLOB is raw binary bytes (images, video, embeddings, or EXTERNAL references 
to such payloads). It is not a valid record-key type semantically:
   - For INLINE BLOBs, the key is the entire byte payload — for real-world 
blobs (MB-sized images/video/embeddings) the key balloons proportionally, 
blowing up shuffle bytes and metadata index (record index, secondary index, 
bloom) storage.
   - For EXTERNAL BLOBs, the key is derived from the storage path, so record 
identity tracks path rather than content — moving or re-uploading the same blob 
yields a different key.
   
   **What you expected:**
   Hudi should reject BLOB-typed columns as the record key, the same way other 
unsupported key types are rejected. This is a type-level restriction — BLOB is 
never an appropriate key, regardless of payload size or storage mode. The 
rejection must cover both write paths:
   
   - Spark DDL: `CREATE TABLE ... TBLPROPERTIES (primaryKey = '<blob_col>')`
   - Spark DataSource writes: 
`.option("hoodie.datasource.write.recordkey.field", "<blob_col>")`
   
   Both should fail fast with a clear error message identifying the BLOB column 
and the unsupported-type reason.
   
   **Steps to reproduce:**
   1. Use 1.2.0-rc2 Spark bundle.
   2. Either:
      a. DDL path: `CREATE TABLE t (id BLOB, label STRING) USING hudi 
TBLPROPERTIES (primaryKey = 'id')`
      b. DataSource path: 
`df.write.format("hudi").option("hoodie.datasource.write.recordkey.field", 
"id").save(...)` with `id` of BLOB type.
   3. INSERT / write a row with an INLINE BLOB value.
   4. `SELECT _hoodie_record_key FROM t` → key is the JSON-serialized struct.
   
   **Environment:**
   - Hudi version: 1.2.0-rc2
   - Query engine: Spark 3.5
   - Found during: 1.2.0-rc2 RC voting testing — non-blocker, follow-up after 
1.2.0 ships.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to