Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]

via GitHub Mon, 12 May 2025 02:42:12 -0700


kosiew commented on PR #15295:
URL: https://github.com/apache/datafusion/pull/15295#issuecomment-2871799892

hi @TheBuilderJR ,

I pushed a simpler test_datafusion_schema_evolution-

https://github.com/apache/datafusion/pull/15295/files#diff-f73022e8850396e8f5595b58e1b24a7fd08499dc6ac85a1070a329d167c2ad65:~:text=%7D-,async%20fn%20test_datafusion_schema_evolution()%20%2D%3E%20Result%3C()%2C,%7D,-fn%20create_batch(

### `test_datafusion_schema_evolution` Function Summary

#### **Key Steps in the Function**

1. **Initialize Session Context:**
- Creates a new DataFusion `SessionContext` for executing SQL queries.

2. **Schema Creation:**
- Defines four schemas (`schema1` to `schema4`) representing different
data structures to simulate schema evolution.

3. **Parquet File Generation:**
- Creates and writes four separate Parquet files (`test_data1.parquet` to
`test_data4.parquet`), each conforming to one of the defined schemas.

4. **Schema Adapter Setup:**
- Creates a `SchemaAdapterFactory` to handle schema differences,
specifically using a `NestedStructSchemaAdapterFactory`.

5. **Listing Table Configuration:**
- Sets up a `ListingTableConfig` to handle multiple paths and apply
schema adaptation, aligning the different schemas to the final schema
(`schema4`).

6. **Inference and Table Registration:**
- Infers the schema configuration, then creates and registers a
`ListingTable` for querying the combined data.

7. **Query Execution:**
- Executes a SQL query to retrieve and sort all the records, asserting
that the combined table contains 4 rows.

8. **Data Compaction:**
- Compacts the four Parquet files into a single output file
(`test_data_compacted.parquet`) for more efficient storage and retrieval.

9. **Compacted File Validation:**
- Reloads the compacted file in a fresh session to confirm that the data
is consistent with the original uncompressed set.

10. **Cleanup:**
- Removes all the generated test files to keep the test environment
clean.

---

#### **Purpose and Coverage**
This function validates that DataFusion can correctly manage schema
evolution across multiple Parquet files, including adapting nested structures
and compacting them into a single, unified schema.

Please let me know whether this works for you.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Enhance Schema adapter to accommodate evolving struct [datafusion]

Reply via email to