andygrove opened a new issue, #3183:
URL: https://github.com/apache/datafusion-comet/issues/3183

   ## What is the problem the feature request solves?
   
   > **Note:** This issue was generated with AI assistance. The specification 
details have been extracted from Spark documentation and may need verification.
   
   Comet does not currently support the Spark `encode` function, causing 
queries using this function to fall back to Spark's JVM execution instead of 
running natively on DataFusion.
   
   The Encode expression converts a string to binary data using a specified 
character encoding. It is a runtime-replaceable expression that delegates to a 
static method for actual encoding operations, supporting legacy charset and 
error handling configurations.
   
   Supporting this expression would allow more Spark workloads to benefit from 
Comet's native acceleration.
   
   ## Describe the potential solution
   
   ### Spark Specification
   
   **Syntax:**
   ```sql
   ENCODE(string_expr, charset_expr)
   ```
   
   **Arguments:**
   | Argument | Type | Description |
   |----------|------|-------------|
   | str | String | The input string expression to be encoded |
   | charset | String | The character set/encoding name to use for conversion |
   | legacyCharsets | Boolean | Internal flag for legacy Java charset handling 
behavior |
   | legacyErrorAction | Boolean | Internal flag for legacy coding error action 
behavior |
   
   **Return Type:** BinaryType - Returns binary data representing the encoded 
string.
   
   **Supported Data Types:**
   - Input string: StringTypeWithCollation (supports trim collation)
   - Input charset: StringTypeWithCollation (supports trim collation)
   - Both string inputs support collation-aware string types
   
   **Edge Cases:**
   - Null input string or charset results in null output
   - Invalid charset names may throw runtime exceptions
   - Legacy flags affect error handling behavior for malformed input
   - Empty string input produces empty binary output
   - Behavior varies based on SQLConf legacy settings for charset and error 
handling
   
   **Examples:**
   ```sql
   -- Encode string using UTF-8
   SELECT ENCODE('hello world', 'UTF-8');
   
   -- Encode with different charset
   SELECT ENCODE('café', 'ISO-8859-1');
   ```
   
   ```scala
   // DataFrame API usage
   import org.apache.spark.sql.functions._
   df.select(expr("ENCODE(name, 'UTF-8')").as("encoded_name"))
   
   // Using column expressions
   df.select(encode(col("text_column"), lit("UTF-16")))
   ```
   
   ### Implementation Approach
   
   See the [Comet guide on adding new 
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
 for detailed instructions.
   
   1. **Scala Serde**: Add expression handler in 
`spark/src/main/scala/org/apache/comet/serde/`
   2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
   3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if 
needed
   4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has 
built-in support first)
   
   
   ## Additional context
   
   **Difficulty:** Medium
   **Spark Expression Class:** 
`org.apache.spark.sql.catalyst.expressions.Encode`
   
   **Related:**
   - Decode - inverse operation to convert binary back to string
   - Cast expressions for type conversions
   - String manipulation functions in string_funcs group
   
   ---
   *This issue was auto-generated from Spark reference documentation.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to