saLeox commented on PR #21149:
URL: https://github.com/apache/flink/pull/21149#issuecomment-1367123546

   Hi @Tartarus0zm , thanks for your comments!
   For the rule constraints, when reading each parquet split, there is an 
existing rule in Flink,  to check whether a column from requested schema is 
missing from the file schema and if the field type is matched, for detail 
please refer to `ParquetReader::createReader` and `ParquetReader::checkColumn` 
in Flink. Our change will not break the rule.
   
   In addition, it's a good point to keep the engine behavior consistent. I 
check the implementation in Spark. It's known that there is `mergeSchema` 
option when reading, and it will merge the schema from all specified parquet 
files, to achieve the schema evolution. After that, inside each parquet reader, 
they also have a collection called `missingColumns` to collect missing columns. 
For the missing column, the whole vector will be set as null, instead of 
throwing error and skipping it in schema. For your reference, Please check the 
implementation of the method `VectorizedParquetRecordReader::checkColumn` and 
`ParquetColumnVector::ParquetColumnVector` in the Spark. In that sense, this 
change will help user achieve schema evolution in Flink, as same as what they 
can do with Spark. 
   
   Please let me know if there is any concern or further clarification needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to