goldmedal commented on issue #12788: URL: https://github.com/apache/datafusion/issues/12788#issuecomment-2402885802
> BTW thinking more about this, I do think we need to support the cast, but in this PR we should effectively change the _file_ schema (not just the table schema) when we setup the parquet reader (specifically with [`ArrowReaderOptions::with_schema`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_schema)) > This idea sounds great. It seems if we can apply the new schema when reading file, we can save one time casting. Just read as string. I tried to follow the implementation of StringView to apply the new schema using `with_schema` but I got casting error. ``` Parquet error: Arrow: incompatible arrow schema, the following fields could not be cast: ``` I can reprodcue this error on the arrow-rs side by added a test case in `parquet/src/arrow/arrow_reader/mod.rs` ```rust #[test] fn test_cast_binary_utf8() { let original_fields = Fields::from(vec![ Field::new("binary_to_utf8", ArrowDataType::Binary, false), ]); let file = write_parquet_from_iter(vec![ ( "binary_to_utf8", Arc::new(BinaryArray::from(vec![b"one".as_ref(), b"two".as_ref()])) as ArrayRef, ), ]); let supplied_fields = Fields::from(vec![ Field::new("binary_to_utf8", ArrowDataType::Utf8, false), ]); let options = ArrowReaderOptions::new().with_schema(Arc::new(Schema::new(supplied_fields))); let mut arrow_reader = ParquetRecordBatchReaderBuilder::try_new_with_options( file.try_clone().unwrap(), options, ) .expect("reader builder with schema") .build() .expect("reader with schema"); let batch = arrow_reader.next().unwrap().unwrap(); assert_eq!(batch.num_columns(), 1); assert_eq!(batch.num_rows(), 2); assert_eq!( batch .column(0) .as_any() .downcast_ref::<StringArray>() .expect("downcast to string") .iter() .collect::<Vec<_>>(), vec![Some("one"), Some("two")] ); } ``` The output is ``` reader builder with schema: ArrowError("incompatible arrow schema, the following fields could not be cast: [binary_to_utf8]") ``` I tired to fix it through adding more pattern match at https://github.com/apache/arrow-rs/blob/5508978a3c5c4eb65ef6410e097887a8adaba38a/parquet/src/arrow/schema/primitive.rs#L40 ```rust (DataType::Binary, DataType::Utf8) => hint, ``` It can work well but I'm not pretty sure if this way makes sense 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
