[PR] [comet-parquet-exec] Simplify schema logic for CometNativeScan [datafusion-comet]

via GitHub Thu, 05 Dec 2024 05:19:26 -0800


mbutrovich opened a new pull request, #1142:
URL: https://github.com/apache/datafusion-comet/pull/1142


   The current logic takes the data schema and the required schema from the 
Java side (in the scan node) and:
   1. Converts back to a Parquet schema
   2. Serializes it to the native side
   3. Parses it to a schema descriptor
   4. Converts that to an Arrow schema
   
   This process is introducing conversion errors that are difficult to recover 
from (e.g. Timestamp(milli) -> INT96 -> Timestamp(nano)). This PR simplifies 
the schema serialization and conversion to native side, building on what 
@viirya did with the partition schema (thank you for the inspiration!).
   
   In this PR, data schema and required schema are now serialized as Spark 
types. On the native side they are converted to Arrow types. We also now 
serialize more schema info (column names, nullability) than we did for just 
partition schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [comet-parquet-exec] Simplify schema logic for CometNativeScan [datafusion-comet]

Reply via email to