[ https://issues.apache.org/jira/browse/FLINK-19161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-19161: ----------------------------------- Labels: pull-request-available (was: ) > Port File Sources to FLIP-27 API > -------------------------------- > > Key: FLINK-19161 > URL: https://issues.apache.org/jira/browse/FLINK-19161 > Project: Flink > Issue Type: Sub-task > Components: Connectors / FileSystem > Reporter: Stephan Ewen > Assignee: Stephan Ewen > Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > Porting the File sources to the FLIP-27 API means combining the > - FileInputFormat from the DataSet Batch API > - The Monitoring File Source from the DataStream API. > The two currently share the same reader code already and partial enumeration > code. > *Structure* > The new File Source will have three components: > - File enumerators that discover the files. > - File split assigners that decide which reader gets what split > - File Reader Formats, which deal with the decoding. > The main difference between the Bounded (Batch) version and the unbounded > (Streaming) version is that the streaming version repeatedly invokes the file > enumerator to search for new files. > *Checkpointing Enumerators* > The enumerators need to checkpoint the not-yet-assigned splits, plus, if they > are in continuous discovery mode (streaming) the paths / timestamps already > processed. > *Checkpointing Readers* > The new File Source needs to ensure that every reader can be checkpointed. > Some readers may be able to expose the position in the input file that > corresponds to the latest emitted record, but many will not be able to do > that due to > - storing compresses record batches > - using buffered decoders where exact position information is not accessible > We therefore suggest to expose a mechanism that combines seekable file > offsets and records to read and skip after that offset. In the extreme cases, > files can work only with seekable positions or only with records-to-skip. > Some sources, like Avro, can have periodic seek points (sync markers) and > count records-to-skip after these markers. > *Efficient and Convenient Readers* > To balance efficiency (batch vectorized reading of ORC / Parquet for > vectorized query processing) and convenience (plug in 3-rd party CSV decoder > over stream) we offer three abstraction for record readers > - Bulk Formats that run over a file Path and return a iterable batch at a > time _(most efficient)_ > - File Record formats which read files record-by-record. The source > framework hands over a pre-defined-size batch from Split Reader to Record > Emitter. > - Stream Formats that decode an input stream and rely on the source > framework to decide how to batch record handover _(most convenient)_ -- This message was sent by Atlassian Jira (v8.3.4#803005)