JingGe opened a new pull request #17501: URL: https://github.com/apache/flink/pull/17501
## What is the purpose of the change The goal of this PR to provide AvroParquetRecordFormat implementation to read avro GenericRecords from parquet via the new Flink FileSource. This is the draft PR, there is one failed unit test case, which is documented out for now with TODO comment. I am still working on it. ## Brief change log - Create RecordFormat interface which focuses on Path with default methods implementation and delegate createReader() calls to the overloaded methods from StreamFormat that focuses on FDDataInputStream. - Create AvroParquetRecordFormat implementation. Only reading avro GenericRecord from parquet file or stream is supported in this version. Support for other avro record types will be implemented later. - Splitting is not supported in this version. ## Open Questions To give you some background, the original idea was to let AvroParquetRecordFormat implement FileRecordFormat. After considering that FileRecordFormat and StreamFormat have too many commons and StreamFormat has more built-in features like compression support(via StreamFormatAdapter), current design is based on StreamFormat. In order to keep SIP clear, 2-levels interfaces have been defined. Let StreamFormat focuses on the abstract input stream and let RecordFormat pay attention to the concrete FileSystem, i.e. the Path. RecordFormat provides default implementation for the overloaded createReader(...) methods. Subclasses are therefore not forced to implement it. Following are some questions open for discussion: 1. Compare to the 2-levels interfaces design, as another option, all default methods implemented in the RecordFormat could be merged into the StreamFormat. Such design keeps all createReader(...) methods in one interface(StreamFormat only, no more RecordFormat) that is in some ways easier to use. The downside is the SIP violation. Based on these consideration, I chose the current 2-levels API design. 2. After considering priorities, Splitting is currently not supported, since we didn't get strong requirement from the business side. It will be implemented later, when the time comes. 3. If this design works well, as next step, we should consider replacing the FileRecordFormat with the RecordFormat. In this way, the duplicated code and javacode could be avoided. This question is a little bit out of the scope of this PR, but does relate to the topic, could be therefore discussed too. ## Verifying this change This change added tests and can be verified as follows: New unit test validates that : - Reader can be created and restored correctly via Path. - Violation constraint checks for no splitting, null path, and no restoreOffset are working well. - GenericRecords can be read correctly from parquet file. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (**yes** / no) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (**yes** / no / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (**yes** / no) - If yes, how is the feature documented? (not applicable / docs / **JavaDocs** / not documented) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org