Hi Gustavo,

thank you for this excellent design document. And thanks for discovering this data loss and driving the investigation. We should definitely fix this shortcoming. Also looking at other vendors, it is definitly a cause for false assumptions that lead to hard-to-debug inconsistencies.

+1 for this proposal.

Cheers,
Timo


On 19.03.26 15:23, Gustavo de Morais wrote:
Hi everyone,

Currently, CAST(bytes AS STRING) silently replaces any invalid UTF-8 byte
with U+FFFD (?). The substitution is irreversible and produces no warning -
the pipeline keeps running while data is permanently corrupted
downstream. This also means that a CAST from BYTES → STRING → BYTES is not
idempotent, which prevents the engine from applying certain optimizations.
For example, for preserving upsert keys after such CASTs.

I'd like to start a discussion around defining and improving the default
behavior. I've written a short FLIP [1] proposing new utility functions to
handle this explicitly - similar to what other engines like Spark already
do - and changing the default behavior to throw an error instead of
silently corrupting data, while giving users clear options to deal with
invalid bytes.

Looking forward to your feedback and thoughts.

Kind regards,
Gustavo

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities


Reply via email to