[ https://issues.apache.org/jira/browse/FLINK-20174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233260#comment-17233260 ]
Steven Zhen Wu commented on FLINK-20174: ---------------------------------------- [~lzljs3620320] thanks a lot for sharing your thoughts. Regarding the hostname, it would be a derived information from the path of Iceberg DataFile. How to extract the hostname depends on the file system. I would image LocalityAwareSplitAssigner probably needs to take a HostnameExtractor function to extract hostname from IcebergSourceSplit. I am wondering if hostname should be a constructor arg for IcebergSourceSplit. Regarding the fine-grained split, here are my concerns of splitting CombinedScanTask into fine-grained FileScanTasks. * In the first try of PoC, I tried flapMap of CombinedScanTask into CombinedScanTasks (each with a single FileScanTask). If I remember correctly, DeleteFilter doesn't work in this case. DataIterator creates the InputFile map for the whole CombinedScanTask. Maybe there is a valid reason for that. I can double check on that with a unit test. * One of the main reasons of having CombinedScanTask is to combine small files/splits into a decent size. Because readers pull one split at a time, avoiding small splits is good for throughput. > Make BulkFormat more extensible > ------------------------------- > > Key: FLINK-20174 > URL: https://issues.apache.org/jira/browse/FLINK-20174 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem > Affects Versions: 1.12.0 > Reporter: Steven Zhen Wu > Priority: Major > > Right now, BulkFormat has the generic `SpitT` type extending from > `FileSourceSplit`. We can make BulkFormat taking the generic `SplitT` type > extending from `SourceSplit`. This way, IcebergSourceSplit doesn't have to > extend from `FileSourceSplit` and Iceberg source can reuse this BulkFormat > interface as [~lzljs3620320] suggested. This allows Iceberg source to take > advantages high-performant `ParquetVectorizedInputFormat` provided by Flink. > [~sewen] [~lzljs3620320] if you are onboard with the change, I would be happy > to submit a PR. Since it is a breaking change, maybe we can only add it to > master branch after 1.12 release branch is cut? > The other related question is the two `createReader` and `restoreReader` > APIs. I understand the motivation. I am just wondering if the separation is > necessary. if the SplitT has the CheckpointedLocation, the seek operation can > be handled internal to `createReader`. We can also define an abstract > `FileSourceSplitBase` that adds a `getCheckpointedPosition` API to the > `SourceSplit`. -- This message was sent by Atlassian Jira (v8.3.4#803005)