eejbyfeldt commented on issue #841: URL: https://github.com/apache/datafusion-comet/issues/841#issuecomment-2298133485
> Letting DataFusion infer the return type instead of specifying it results in > > ``` > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (10.10.0.29 executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.comet.shaded.arrow.vector.dictionary.DictionaryProvider.lookup(long)" because "dictionaryProvider" is null > ``` I belive this would be resolved by forwarding the `dictionaryProvider` into the `CometListVector` similarly to what was done with the `CometMapVector` and `CometStructVector` in this PR: https://github.com/apache/datafusion-comet/pull/789/files > Not sure what the expected behavior or fix is. Either implement this function from scratch with better dictionary handling, or add some wrapper around invoking the UDF to flatten dictionary encoded arrays To me this seems like a bug in the upstream datafusion implementation. Would it not be better to address the bug there? e.g make that implementation have correct behavior around dictionary encoded types. > Does it make sense to make an Expression equivalent of `CopyExec`? Then for things like DataFusion scalar UDFs used directly you could just wrap it a `CopyArrays` expression that just unpacks dictionaries if they exist? Then you wouldn't need custom Rust code for handling it, can just be done in the Scala query serde. Downside is if you have multiple expressions in a single project you might end up unpacking the same thing multiple times. Seems like that would cause an unnecessary copy in the case of the `make_array`. To me it seems like we should just be "unpacking the dictionary as part of the data getting writing into the new array data structure. And other expressions might be able to do other optimizations by having data in the dictionary format for example https://github.com/apache/datafusion-comet/issues/504 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
