kosiew commented on PR #15295: URL: https://github.com/apache/datafusion/pull/15295#issuecomment-2871799892
hi @TheBuilderJR , I pushed a simpler test_datafusion_schema_evolution- https://github.com/apache/datafusion/pull/15295/files#diff-f73022e8850396e8f5595b58e1b24a7fd08499dc6ac85a1070a329d167c2ad65:~:text=%7D-,async%20fn%20test_datafusion_schema_evolution()%20%2D%3E%20Result%3C()%2C,%7D,-fn%20create_batch( ### `test_datafusion_schema_evolution` Function Summary #### **Key Steps in the Function** 1. **Initialize Session Context:** - Creates a new DataFusion `SessionContext` for executing SQL queries. 2. **Schema Creation:** - Defines four schemas (`schema1` to `schema4`) representing different data structures to simulate schema evolution. 3. **Parquet File Generation:** - Creates and writes four separate Parquet files (`test_data1.parquet` to `test_data4.parquet`), each conforming to one of the defined schemas. 4. **Schema Adapter Setup:** - Creates a `SchemaAdapterFactory` to handle schema differences, specifically using a `NestedStructSchemaAdapterFactory`. 5. **Listing Table Configuration:** - Sets up a `ListingTableConfig` to handle multiple paths and apply schema adaptation, aligning the different schemas to the final schema (`schema4`). 6. **Inference and Table Registration:** - Infers the schema configuration, then creates and registers a `ListingTable` for querying the combined data. 7. **Query Execution:** - Executes a SQL query to retrieve and sort all the records, asserting that the combined table contains 4 rows. 8. **Data Compaction:** - Compacts the four Parquet files into a single output file (`test_data_compacted.parquet`) for more efficient storage and retrieval. 9. **Compacted File Validation:** - Reloads the compacted file in a fresh session to confirm that the data is consistent with the original uncompressed set. 10. **Cleanup:** - Removes all the generated test files to keep the test environment clean. --- #### **Purpose and Coverage** This function validates that DataFusion can correctly manage schema evolution across multiple Parquet files, including adapting nested structures and compacting them into a single, unified schema. Please let me know whether this works for you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
