Hi,

Regarding CSV and AvroParquet stream formats doens't supporting splits, I
think some hints may be available from [1]. Personally, I think the main
consideration should be the question of how the row format can find a
reasonable split point, and how many Splits are appropriate to slice a file
more than one. For Orc&Parquet and other columnar formats, within a file,
it will be further split according to the RowGroup, Page, etc. However, row
formats do not have such information, maybe we can not find a suitable
basis for split.


[1]
https://github.com/apache/flink/blob/9546f8243a24e7b45582b6de6702f819f1d73f97/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/StreamFormat.java#L57

Best,
Ron

Chirag Dewan via user <user@flink.apache.org> 于2023年8月17日周四 12:00写道:

> Hi,
> I am trying to collect files from HDFS in my DataStream job. I need to
> collect two types of files - CSV and Parquet.
>
> I understand that Flink supports both formats, but in Streaming mode,
> Flink doesnt support splitting these formats. Splitting is only supported
> in Table API.
>
> I wanted to understand the thought process around this and why splitting
> is not supported in CSV and AvroParquet Stream formats? As far as my
> understanding goes, splitting would work fine with HDFS blocks and multiple
> blocks can be read in parallel.
>
> Maybe I am missing some fundamental aspect about this.
>
> Would like to understand more if someone can point me in the right
> direction.
> Thanks
>
>

Reply via email to