rishvin commented on issue #1820:
URL: 
https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2953616585

   Hi @andygrove, I tried the following approach and looks like there is some 
discrepancy in the Datafusion's `SparkSha2` output with Spark.
   
   **This is what I attempted**
   
   Removed the existing ssh(224|256|384|512) and other related codes from 
[comet_scalar_funcs.rs](https://github.com/rishvin/datafusion-comet/blob/1b1d6185ede9175887de1e9ec7f48422c2a64a10/native/spark-expr/src/comet_scalar_funcs.rs#L120)
   
   Registered the `SparkSha2` UDF 
[here](https://github.com/rishvin/datafusion-comet/blob/1b1d6185ede9175887de1e9ec7f48422c2a64a10/native/core/src/execution/planner.rs#L160).
   
   Modified the serialization logic at Spark to roughly like so,
   ```
   case Sha2(left, numBits) =>
       if (!numBits.foldable) {
             withInfo(expr, "non literal numBits is not supported")
             return None
       }
       val childExpr = exprToProtoInternal(left, inputs, binding)
       val bitsExpr = exprToProtoInternal(numBits, inputs, binding)
       scalarFunctionExprToProtoWithReturnType("sha2", StringType, childExpr, 
bitsExpr)
   ```
   
   
   With these changes we hit the following exception - `Unsupported argument 
types for sha2 function`. 
   This is because the `SparkSha2` pattern matching supports `numBits` type to 
be UInt32 but we are sending type Int32 type from spark (Ref: 
[here](https://github.com/apache/datafusion/blob/1daa5ed5cc51546904d45e23cc148601d973942a/datafusion/spark/src/function/hash/sha2.rs#L141)).
 I checked the supported types for Spark which we define 
[here](https://github.com/rishvin/datafusion-comet/blob/1b1d6185ede9175887de1e9ec7f48422c2a64a10/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala#L90),
 but looks like Spark doesn't support UInt32. 
   
   To test things further, I duplicated the `SparkSha2` into Comet and added 
Int32 into the pattern matching. This time the output response was successfully 
returned but the results did not match with the expected spark result. The hash 
returned by the `SparkSha2` is in uppercase, however, Spark returns SHA2 in 
lowercase. Here is an example of the two,
   Comet returned - 
`2C83E9E8A39D60F7FCD3CFEC29C154260AA069F91CD40C972756F9354C64594E` 
   Spark returned - 
`2c83e9e8a39d60f7fcd3cfec29c154260aa069f91cd40c972756f9354c64594e`.
   
   This happened because the 
[hex_encode](https://github.com/apache/datafusion/blob/1daa5ed5cc51546904d45e23cc148601d973942a/datafusion/spark/src/function/math/hex.rs#L163)
 method used by the `SparkSha2` sets the lowercase=false (second parameter). 
However in the existing Comet implementation lowercase is set to true, 
[here](https://github.com/rishvin/datafusion-comet/blob/1b1d6185ede9175887de1e9ec7f48422c2a64a10/native/spark-expr/src/math_funcs/hex.rs#L56).
 Hence, we are getting the matching results from comet.
   
   To make it work we might have to do following changes to the Datafusion's 
`SparkSha2`,
   - Add support for pattern matching Int32 type.
   - Add support to return hashes in lowercase.
   
   Alternatively, if are not suppose to make these change in Datafusion, we 
might have to create some wrappers over `SparkSha2` in Comet.
   
   Let me know your thoughts on this or I'm missing anything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to