sergiimk commented on issue #22935:
URL: https://github.com/apache/datafusion/issues/22935#issuecomment-4696737345
I narrowed down the problem a bit:
Greatly simplifying, my test performs CDC between tables like these:
```sql
create or replace table old (
city string not null,
population bigint not null
) as values
('A', 1000),
('B', 2000),
('C', 3000);
create or replace table new (
city string not null,
population bigint not null,
census_url string
) as values
('A', 1000, null),
('B', 2000, 'https://b.ca/census'),
('C', 3000, null),
('D', 4000, 'https://d.ca/census');
```
It checks the special case where the `new` table has evolved by adding a new
optional column `census_url`.
Under the hood it loads `old` using `SessionContext::read_parquet` and
passes `new.schema()` as explicit schema to avoid schema inference from
Parquet, expecting that the missing `census_url` column will be filled with
`null`s.
When I remove explicit schema in `read_parquet` letting DF infer the old
schema, and then manually add the missing column via
`df.with_column("census_url", lit(ScalarValue::Utf8(None)))` the panic
disappears.
I believe this edge case of loading Parquet with explicit schema that has an
unknown column is whats breaking statistics.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]