eejbyfeldt commented on issue #841:
URL: 
https://github.com/apache/datafusion-comet/issues/841#issuecomment-2298133485

   > Letting DataFusion infer the return type instead of specifying it results 
in
   > 
   > ```
   > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 
(TID 11) (10.10.0.29 executor driver): java.lang.NullPointerException: Cannot 
invoke 
"org.apache.comet.shaded.arrow.vector.dictionary.DictionaryProvider.lookup(long)"
 because "dictionaryProvider" is null
   > ```
   
   I belive this would be resolved by forwarding the `dictionaryProvider` into 
the `CometListVector` similarly to what was done with the `CometMapVector` and 
`CometStructVector` in this PR: 
https://github.com/apache/datafusion-comet/pull/789/files
   
   
   > Not sure what the expected behavior or fix is. Either implement this 
function from scratch with better dictionary handling, or add some wrapper 
around invoking the UDF to flatten dictionary encoded arrays
   
   To me this seems like a bug in the upstream datafusion implementation. Would 
it not be better to address the bug there? e.g make that implementation have 
correct behavior around dictionary encoded types.
   
   
   > Does it make sense to make an Expression equivalent of `CopyExec`? Then 
for things like DataFusion scalar UDFs used directly you could just wrap it a 
`CopyArrays` expression that just unpacks dictionaries if they exist? Then you 
wouldn't need custom Rust code for handling it, can just be done in the Scala 
query serde. Downside is if you have multiple expressions in a single project 
you might end up unpacking the same thing multiple times.
   
   Seems like that would cause an unnecessary copy in the case of the 
`make_array`. To me it seems like we should just be "unpacking the dictionary 
as part of the data getting writing into the new array data structure. And 
other expressions might be able to do other optimizations by having data in the 
dictionary format for example 
https://github.com/apache/datafusion-comet/issues/504


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to