[GitHub] [flink] AHeise commented on pull request #15725: [FLINK-21389] determine parquet schema from file instead of taking it from user

GitBox Wed, 26 May 2021 05:55:49 -0700


AHeise commented on pull request #15725:
URL: https://github.com/apache/flink/pull/15725#issuecomment-848745186



   Same as in the other PR please rebase onto 1.13 and update the target branch 
accordingly.
   
   A high-level question to get me faster up-to-speed:
   - In Avro, we have reader and writer schema. If schema evolves, the writer 
schema of each record updates and through schema compability, I still get the 
equivalent record in the reader schema automatically. So for Avro, I'd usually 
specify an additional schema to make sure that my application is forward and 
backward compatible. 
   - Now it seems like Parquet (haven't checked the details yet), there is a 
similar concept. Having a particular reader schema is even more important as it 
allows us to skip reading large chunks of the file if a specific column is not 
needed thanks to the columnar layout of the file.
   - Is your change now effectively disabling the reader schema? Or can it just 
be omitted and assumed to be the writer schema?
   - How would it work when I read 2 parquet files with different schemas but 
both can be mapped to the same reader schema? For example, consider a schema 
evolution case, where 1 file is written by pipeline v1 and file 2 is written by 
pipeline v2 with an additional column that is ignored in the consuming Flink 
application.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] AHeise commented on pull request #15725: [FLINK-21389] determine parquet schema from file instead of taking it from user

Reply via email to