Re: [PR] feat: Optimze CreateNamedStruct preserve dictionaries [datafusion-comet]

via GitHub Fri, 09 Aug 2024 12:07:14 -0700


andygrove commented on PR #789:
URL: https://github.com/apache/datafusion-comet/pull/789#issuecomment-2278573348


   I've been testing from Spark shell with the following and with some debug 
logging added to `ScanExec` and `CreateNamedStruct`:
   
   ```scala
   
spark.read.parquet("/mnt/bigdata/tpcds/sf100/item.parquet").createTempView("item")
   val df = spark.sql("SELECT struct(i_product_name) as item_detail from item")
   spark.time(df.write.format("noop").mode("overwrite").save())
   ```
   
   When I run this in the main branch, I see:
   
   ```
   creating ScanStream with schema Schema { fields: [Field { name: "col_0", 
data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }], metadata: {} }
   ScanStream::build_record_batch() with 8192 rows
   CreateNamedStruct unpacking dictionary
   ScanStream::build_record_batch() with 8192 rows
   CreateNamedStruct unpacking dictionary
   ...
   ScanStream::build_record_batch() with 8192 rows
   ScanExec casting from Utf8 to Dictionary(Int32, Utf8)
   CreateNamedStruct unpacking dictionary
   ScanStream::build_record_batch() with 8192 rows
   ScanExec casting from Utf8 to Dictionary(Int32, Utf8)
   ```
   
   It was interesting to learn that `ScanExec` processes some batches where the 
data is already dictionary-encoded and some batches where if was not. ScanExec 
will then convert to dictionary-encoded for consistency. In either case, 
`CreateNamedStruct` was then unpacking those arrays immediately afterwards.
   
   In the current PR, we are no longer unpacking them when creating the struct, 
which is avoiding some compute cost and avoiding using more memory.
   
   @kazuyukitanimura I didn't run a microbenchmark to show the time saved, but 
I think this demonstrates that we are avoiding an operation that is not needed, 
and would result in some overhead (time and memory)?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Optimze CreateNamedStruct preserve dictionaries [datafusion-comet]

Reply via email to