JeelRajodiya commented on PR #21331: URL: https://github.com/apache/datafusion/pull/21331#issuecomment-4275897689
Hey @andygrove, I realized that I shouldn't be using `enable_ansi_mode` flag inside encode function. In the spark definition they are not binding the ansi mode to encode function. Moreover we should target Spark 3.5 which is more permissive and doesn't return errors when null inputs are passed. it simply replaces it with `?`. But I've added a TODO in the doc comment pointing at the two real Spark 4.1 configs so a follow-up PR can wire them properly. **Below are the references to the spark definitions** Spark 3.5's [`Encode.scala`](https://github.com/apache/spark/blob/2a56312aeb1665b72c608e14926f5d69fd3a17bc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2698-L2741): ``` scala protected override def nullSafeEval(input1: Any, input2: Any): Any = { input1.asInstanceOf[UTF8String].toString.getBytes(toCharset) } ``` Just calls Java's `String.getBytes`, which replaces unmappable chars with the charset's default byte (?). No `legacyErrorAction`, no config, no exception. Spark 4.1's [`Encode.scala`](https://github.com/apache/spark/blob/acfae3372874631728243ba13728f6abbf7ee07b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L3170-L3228) added two new configs for the strict behavior: ``` scala case class Encode(str, charset, legacyCharsets: Boolean, legacyErrorAction: Boolean) def this(value, charset) = this(value, charset, SQLConf.get.legacyJavaCharsets, SQLConf.get.legacyCodingErrorAction) ``` > Setting legacyErrorAction=true restores the Spark 3.5 `?` behavior. These `spark.sql.legacy.javaCharsets` and `spark.sql.legacy.codingErrorAction` are supported in 4.1 version. which can be left for future PR. Currently the PR targets Spark 3.5. I've added mentioned in the doc comment as well. Let me know if we need to iterate on this further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
