rahil-c opened a new issue, #18820:
URL: https://github.com/apache/hudi/issues/18820

   **What happened:**
   Using `read_blob()` inside a WHERE predicate fails with:
   
   ```
   [INTERNAL_ERROR] Cannot generate code for expression: read_blob(...)
   ```
   
   Example query:
   ```sql
   SELECT id FROM t WHERE length(read_blob(image_bytes)) = 11;
   ```
   
   `read_blob()` works correctly in the SELECT list — only filter predicates 
trigger the codegen failure.
   
   **What you expected:**
   Two things:
   1. The codegen restriction should surface as an analyzer-level rejection 
with a clear "read_blob() is not supported in filter predicates" message, not 
an INTERNAL_ERROR with a Spark codegen stack trace.
   2. Docs (AI quick start) should call out the recommended workaround: for 
length-based filtering, filter on the BLOB struct's `.length` subfield from the 
meta columns (e.g. `WHERE image_bytes.length = 11`) rather than wrapping 
`read_blob()` in `length(...)`. Typical usage is vector search or filtering on 
structured columns; pulling raw bytes through codegen in a predicate is not a 
supported path.
   
   **Steps to reproduce:**
   1. Use 1.2.0-rc2 Spark bundle.
   2. Create a table with a BLOB column `image_bytes` and insert rows.
   3. Run: `SELECT id FROM t WHERE length(read_blob(image_bytes)) = 11`.
   4. Observe INTERNAL_ERROR.
   
   **Environment:**
   - Hudi version: 1.2.0-rc2
   - Query engine: Spark 3.5
   - Found during: 1.2.0-rc2 RC voting testing
   
   Filed as a follow-up per discussion in the 1.2.0-rc2 voting thread — 
non-blocker for the release. Separate docs PR will cover the length-filter 
workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to