[ 
https://issues.apache.org/jira/browse/FLINK-20174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233260#comment-17233260
 ] 

Steven Zhen Wu commented on FLINK-20174:
----------------------------------------

[~lzljs3620320] thanks a lot for sharing your thoughts.

Regarding the hostname, it would be a derived information from the path of 
Iceberg DataFile. How to extract the hostname depends on the file system. I 
would image LocalityAwareSplitAssigner probably needs to take a 
HostnameExtractor function to extract hostname from IcebergSourceSplit. I am 
wondering if hostname should be a constructor arg for IcebergSourceSplit.

Regarding the fine-grained split, here are my concerns of splitting 
CombinedScanTask into fine-grained FileScanTasks.
* In the first try of PoC, I tried flapMap of CombinedScanTask into 
CombinedScanTasks (each with a single FileScanTask). If I remember correctly, 
DeleteFilter doesn't work in this case. DataIterator creates the InputFile map 
for the whole CombinedScanTask. Maybe there is a valid reason for that. I can 
double check on that with a unit test.
* One of the main reasons of having CombinedScanTask is to combine small 
files/splits into a decent size. Because readers pull one split at a time, 
avoiding small splits is good for throughput. 


> Make BulkFormat more extensible
> -------------------------------
>
>                 Key: FLINK-20174
>                 URL: https://issues.apache.org/jira/browse/FLINK-20174
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>    Affects Versions: 1.12.0
>            Reporter: Steven Zhen Wu
>            Priority: Major
>
> Right now, BulkFormat has the generic `SpitT` type extending from 
> `FileSourceSplit`. We can make BulkFormat taking the generic `SplitT` type 
> extending from `SourceSplit`. This way, IcebergSourceSplit doesn't have to 
> extend from `FileSourceSplit` and Iceberg source can reuse this BulkFormat 
> interface as [~lzljs3620320] suggested. This allows Iceberg source to take 
> advantages high-performant `ParquetVectorizedInputFormat` provided by Flink.  
> [~sewen] [~lzljs3620320] if you are onboard with the change, I would be happy 
> to submit a PR. Since it is a breaking change, maybe we can only add it to 
> master branch after 1.12 release branch is cut?
> The other related question is the two `createReader` and `restoreReader` 
> APIs. I understand the motivation. I am just wondering if the separation is 
> necessary. if the SplitT has the CheckpointedLocation, the seek operation can 
> be handled internal to `createReader`. We can also define an abstract 
> `FileSourceSplitBase` that adds a `getCheckpointedPosition` API to the 
> `SourceSplit`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to