saLeox commented on PR #21149: URL: https://github.com/apache/flink/pull/21149#issuecomment-1367123546
Hi @Tartarus0zm , thanks for your comments! For the rule constraints, when reading each parquet split, there is an existing rule in Flink, to check whether a column from requested schema is missing from the file schema and if the field type is matched, for detail please refer to `ParquetReader::createReader` and `ParquetReader::checkColumn` in Flink. Our change will not break the rule. In addition, it's a good point to keep the engine behavior consistent. I check the implementation in Spark. It's known that there is `mergeSchema` option when reading, and it will merge the schema from all specified parquet files, to achieve the schema evolution. After that, inside each parquet reader, they also have a collection called `missingColumns` to collect missing columns. For the missing column, the whole vector will be set as null, instead of throwing error and skipping it in schema. For your reference, Please check the implementation of the method `VectorizedParquetRecordReader::checkColumn` and `ParquetColumnVector::ParquetColumnVector` in the Spark. In that sense, this change will help user achieve schema evolution in Flink, as same as what they can do with Spark. Please let me know if there is any concern or further clarification needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org