[ https://issues.apache.org/jira/browse/FLINK-20174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233755#comment-17233755 ]
Steven Zhen Wu edited comment on FLINK-20174 at 11/17/20, 5:23 PM: ------------------------------------------------------------------- I tried the change of setting up InputFiles map for DeleteFilter per FileScanTask. I was wrong earlier. `TestFlinkInputFormatReaderDeletes` works fine after [the change|https://github.com/stevenzwu/iceberg/pull/4]. So I guess the only reason left is the throughput benefit from combining small files into a decent sized CombinedScanTask, which is still an important reason. was (Author: stevenz3wu): I tried the change of setting up InputFiles map for DeleteFilter per FileScanTask. I was wrong earlier. It actually works fine. You can see the change here. https://github.com/stevenzwu/iceberg/pull/4 So I guess the only reason left is the throughput benefit from combining small files into a decent sized CombinedScanTask, which is still an important reason. > Make BulkFormat more extensible > ------------------------------- > > Key: FLINK-20174 > URL: https://issues.apache.org/jira/browse/FLINK-20174 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem > Affects Versions: 1.12.0 > Reporter: Steven Zhen Wu > Priority: Major > > Right now, BulkFormat has the generic `SpitT` type extending from > `FileSourceSplit`. We can make BulkFormat taking the generic `SplitT` type > extending from `SourceSplit`. This way, IcebergSourceSplit doesn't have to > extend from `FileSourceSplit` and Iceberg source can reuse this BulkFormat > interface as [~lzljs3620320] suggested. This allows Iceberg source to take > advantages high-performant `ParquetVectorizedInputFormat` provided by Flink. > [~sewen] [~lzljs3620320] if you are onboard with the change, I would be happy > to submit a PR. Since it is a breaking change, maybe we can only add it to > master branch after 1.12 release branch is cut? > The other related question is the two `createReader` and `restoreReader` > APIs. I understand the motivation. I am just wondering if the separation is > necessary. if the SplitT has the CheckpointedLocation, the seek operation can > be handled internal to `createReader`. We can also define an abstract > `FileSourceSplitBase` that adds a `getCheckpointedPosition` API to the > `SourceSplit`. -- This message was sent by Atlassian Jira (v8.3.4#803005)