Hi, Regarding CSV and AvroParquet stream formats doens't supporting splits, I think some hints may be available from [1]. Personally, I think the main consideration should be the question of how the row format can find a reasonable split point, and how many Splits are appropriate to slice a file more than one. For Orc&Parquet and other columnar formats, within a file, it will be further split according to the RowGroup, Page, etc. However, row formats do not have such information, maybe we can not find a suitable basis for split.
[1] https://github.com/apache/flink/blob/9546f8243a24e7b45582b6de6702f819f1d73f97/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/StreamFormat.java#L57 Best, Ron Chirag Dewan via user <user@flink.apache.org> 于2023年8月17日周四 12:00写道: > Hi, > I am trying to collect files from HDFS in my DataStream job. I need to > collect two types of files - CSV and Parquet. > > I understand that Flink supports both formats, but in Streaming mode, > Flink doesnt support splitting these formats. Splitting is only supported > in Table API. > > I wanted to understand the thought process around this and why splitting > is not supported in CSV and AvroParquet Stream formats? As far as my > understanding goes, splitting would work fine with HDFS blocks and multiple > blocks can be read in parallel. > > Maybe I am missing some fundamental aspect about this. > > Would like to understand more if someone can point me in the right > direction. > Thanks > >