Re: [PR] Fix parquet test using wrong schema [datafusion]

via GitHub Thu, 22 May 2025 02:00:32 -0700


xudong963 commented on code in PR #16133:
URL: https://github.com/apache/datafusion/pull/16133#discussion_r2102026730



##########
datafusion/core/src/datasource/physical_plan/parquet.rs:
##########
@@ -200,26 +210,43 @@ mod tests {
 
         /// run the test, returning the `RoundTripResult`
         async fn round_trip(&self, batches: Vec<RecordBatch>) -> 
RoundTripResult {
-            let file_schema = match &self.schema {
+            self.round_trip_with_file_batches(batches, None).await
+        }
+
+        /// run the test, returning the `RoundTripResult`
+        /// If your table schema is different from file schema, you may need 
to specify the `file_batches` with the file schema
+        /// Or the file schema in the parquet source will be table schema, see 
`store_parquet` for detail
+        async fn round_trip_with_file_batches(
+            &self,
+            batches: Vec<RecordBatch>,
+            file_batches: Option<Vec<RecordBatch>>,
+        ) -> RoundTripResult {
+            let batches_schema =
+                Schema::try_merge(batches.iter().map(|b| 
b.schema().as_ref().clone()));
+            let file_schema = match &self.physical_file_schema {
                 Some(schema) => schema,
-                None => &Arc::new(
-                    Schema::try_merge(
-                        batches.iter().map(|b| b.schema().as_ref().clone()),
-                    )
-                    .unwrap(),
-                ),
+                None => &Arc::new(batches_schema.as_ref().unwrap().clone()),
             };
             let file_schema = Arc::clone(file_schema);
+            let table_schema = match &self.logical_file_schema {
+                Some(schema) => schema,
+                None => &Arc::new(batches_schema.as_ref().unwrap().clone()),
+            };
+
             // If testing with page_index_predicate, write parquet
             // files with multiple pages
             let multi_page = self.page_index_predicate;
-            let (meta, _files) = store_parquet(batches, 
multi_page).await.unwrap();

Review Comment:
   The original code uses `batches` to write parquet files, and the physical 
file schema used in the parquet source will be table schema(logical file 
schema), so the tests in https://github.com/apache/datafusion/pull/16086 may be 
meaningless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix parquet test using wrong schema [datafusion]

Reply via email to