Dylan He created FLINK-36267: -------------------------------- Summary: SPLIT doesn't support SMP characters if delimiter is empty Key: FLINK-36267 URL: https://issues.apache.org/jira/browse/FLINK-36267 Project: Flink Issue Type: Bug Components: Table SQL / API Reporter: Dylan He
In Flink: {code:sql} > SELECT SPLIT('123😊笑脸', ''); ["1", "2", "3", "?", "?", "笑", "脸"] > SELECT SPLIT('123😊笑脸', '😊'); ["123", "笑脸"] > SELECT SPLIT('123😊笑脸', '3'); ["12", "😊笑脸"] {code} While in Spark: {code:sql} > SELECT SPLIT('123😊笑脸', ''); ["1", "2", "3", "😊", "笑", "脸"] {code} I think this may be a bug. But I'm not sure the best way to solve this, here are two ideas: # Keep the code of handling empty delimiter separate from normal cases that use {{BinaryStringDataUtil#splitByWholeSeparatorPreserveAllTokens()}} as it used to do. # Modify {{BinaryStringDataUtil#splitByWholeSeparatorPreserveAllTokens()}} to align with the SPLIT semantics, meaning that it should separate every character when the delimiter is empty. I haven't seen this method used elsewhere, so this should be practical. -- This message was sent by Atlassian Jira (v8.20.10#820010)