Re: FileSource with Parquet Format - parallelism level

2021-12-15 Thread Arvid Heise
Hi Krzysztof, yes you are correct if you use the new FileSource: * Please note that file blocks are only exposed by some file systems, such as HDFS. File systems * that do not expose block information will not create multiple file splits per file, but keep the * files as one source split. For o

Re: FileSource with Parquet Format - parallelism level

2021-12-14 Thread Krzysztof Chmielewski
Hi Arvid, thank you for your response. I did a little bit more digging and analyzing and I noticed one thing, Please correct me if I'm wrong. Whether the Parquet file will be read in parallel in fact depends on underlying file system. If the file system supports file blocks then we will have spli

Re: FileSource with Parquet Format - parallelism level

2021-12-10 Thread Arvid Heise
Yes, Parquet files can be read in splits (=in parallel). Which enumerator is used is determined here [1]. [1] https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetVectorizedInputFormat.java#L170-L170 On Fri, Dec 10, 2021 at

Re: FileSource with Parquet Format - parallelism level

2021-12-10 Thread Krzysztof Chmielewski
Hi Roman, Thank you. I'm familiar with FLIP-27 and I was analyzing the new File Source. >From there I saw that there are two FileEnumerators -> one that allows for file split and other that does not. BlockSplittingRecursiveEnumerator and NonSplittingRecursiveEnumerator. I was wondering if BlockS

Re: FileSource with Parquet Format - parallelism level

2021-12-10 Thread Roman Khachatryan
Hi, Yes, file source does support DoP > 1. And in general, a single file can be read in parallel after FLIP-27. However, parallel reading of a single Parquet file is currently not supported AFAIK. Maybe Arvid or Fabian could shed more light here. Regards, Roman On Thu, Dec 9, 2021 at 12:03 PM K