comphead opened a new issue, #15162:
URL: https://github.com/apache/datafusion/issues/15162

   ### Is your feature request related to a problem or challenge?
   
   In Apache DataFusion Comet during implementation to handle ARRAY types from 
Apache Spark it was found that the inner field hardcoded name is different is 
Arrow-rs and Apache Spark.
   
   The inner ListType field is hardcoded to `item` in 
https://github.com/apache/arrow-rs/blob/f4fde769ab6e1a9b75f890b7f8b47bc22800830b/arrow-schema/src/field.rs#L130
   
   However it is a `element` for Apache Spark
   
   ```
   scala> spark.sql("select array(1, 2, 3)").printSchema
   root
    |-- array(1, 2, 3): array (nullable = false)
    |    |-- element: integer (containsNull = false)
   ```
   
   Because of this discrepancy the schema failed when the record batch gets 
created
   
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 309.0 failed 1 times, most
    recent failure: Lost task 0.0 in stage 309.0 (TID 797) 
(Mac-1741305812954.local executor driver): 
   org.apache.comet.CometNativeException: Invalid argument error: column types 
must match schema types, 
   expected List(Field { name: "element", data_type: Int8, nullable: true, 
dict_id: 0, dict_is_ordered: false, metadata: 
   {} }) but found List(Field { name: "item", data_type: Int8, nullable: true, 
dict_id: 0, dict_is_ordered: false, 
   metadata: {} }) at column index 0
   ```
   
   
   In DataFusion the List creation method `Field::new_list_field` with 
hardcoded field name is heavily used. The ticket idea is to find a way how to 
parametrize this.
   - Replace `Field::new_list_field` with `Field::new` which gives an 
opportunity to provide a custom name. However those methods are often called 
from the context where is no `SessionContext` exist and thus there is no 
possibility to access to config variable where new name can be parametrized
   - Make the name parametrized in arrow-rs, unfortunately there is no external 
config in arrow-rs. It is possible to leverage ENV vars but this is usually not 
a good way to go
   - Change `RecordBatch::try_new` and for ListTypes avoid checking inner 
naming just check the inner datatype
   
   
   Related https://github.com/apache/datafusion-comet/pull/1456
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to