sergiimk commented on issue #22935:
URL: https://github.com/apache/datafusion/issues/22935#issuecomment-4696737345

   I narrowed down the problem a bit:
   
   Greatly simplifying, my test performs CDC between tables like these:
   ```sql
   create or replace table old (
       city string not null,
       population bigint not null
   ) as values
   ('A', 1000),
   ('B', 2000),
   ('C', 3000);
   
   create or replace table new (
       city string not null,
       population bigint not null,
       census_url string
   ) as values
   ('A', 1000, null),
   ('B', 2000, 'https://b.ca/census'),
   ('C', 3000, null),
   ('D', 4000, 'https://d.ca/census');
   ```
   
   It checks the special case where the `new` table has evolved by adding a new 
optional column `census_url`.
   
   Under the hood it loads `old` using `SessionContext::read_parquet` and 
passes `new.schema()` as explicit schema to avoid schema inference from 
Parquet, expecting that the missing `census_url` column will be filled with 
`null`s.
   
   When I remove explicit schema in `read_parquet` letting DF infer the old 
schema, and then manually add the missing column via 
`df.with_column("census_url", lit(ScalarValue::Utf8(None)))` the panic 
disappears.
   
   I believe this edge case of loading Parquet with explicit schema that has an 
unknown column is whats breaking statistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to