
Regarding CSV and AvroParquet stream formats doens't supporting splits, I
think some hints may be available from [1]. Personally, I think the main
consideration should be the question of how the row format can find a
reasonable split point, and how many Splits are appropriate to slice a file
more than one. For Orc&Parquet and other columnar formats, within a file,
it will be further split according to the RowGroup, Page, etc. However, row
formats do not have such information, maybe we can not find a suitable
basis for split.



Chirag Dewan via user <user@flink.apache.org> 于2023年8月17日周四 12:00写道:

> Hi,
> I am trying to collect files from HDFS in my DataStream job. I need to
> collect two types of files - CSV and Parquet.
> I understand that Flink supports both formats, but in Streaming mode,
> Flink doesnt support splitting these formats. Splitting is only supported
> in Table API.
> I wanted to understand the thought process around this and why splitting
> is not supported in CSV and AvroParquet Stream formats? As far as my
> understanding goes, splitting would work fine with HDFS blocks and multiple
> blocks can be read in parallel.
> Maybe I am missing some fundamental aspect about this.
> Would like to understand more if someone can point me in the right
> direction.
> Thanks

Reply via email to