Hi Krzysztof, yes you are correct if you use the new FileSource:
* <p>Please note that file blocks are only exposed by some file systems, such as HDFS. File systems * that do not expose block information will not create multiple file splits per file, but keep the * files as one source split. For old API [1], most filesystems actually faked some block size (s3 was 32mb by default but could be overwritten with fs.local.block.size). @Stephan Ewen <se...@apache.org>, do you think it would make sense to add such a default block size to FileSource as a user option? The S3 filesystem is in the distribution, please refer to [2]. I recommend the Hadoop S3 for your use case. [1] https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/formats/parquet/ [2] https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/filesystems/overview/ On Tue, Dec 14, 2021 at 11:51 PM Krzysztof Chmielewski < krzysiek.chmielew...@gmail.com> wrote: > Hi Arvid, > thank you for your response. > > I did a little bit more digging and analyzing and I noticed one thing, > Please correct me if I'm wrong. > > Whether the Parquet file will be read in parallel in fact depends on > underlying file system. > If the file system supports file blocks then we will have splits for > individual blocks and Parquet file will be read in parallel. > > https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141 > > An example is HDFS but for local file system, we will have only one block, > hence each file will be read only by one thread. > > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/local/LocalFileStatus.java#L96 > > Am I correct? > > Also one additional question. > Can Flink File Source read files from AWS S3? > > Regards, > Krzysztof Chmielewski > > > > pt., 10 gru 2021 o 15:29 Arvid Heise <ar...@apache.org> napisał(a): > >> Yes, Parquet files can be read in splits (=in parallel). Which enumerator >> is used is determined here [1]. >> >> [1] >> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetVectorizedInputFormat.java#L170-L170 >> >> On Fri, Dec 10, 2021 at 11:44 AM Krzysztof Chmielewski < >> krzysiek.chmielew...@gmail.com> wrote: >> >>> Hi Roman, >>> Thank you. >>> >>> I'm familiar with FLIP-27 and I was analyzing the new File Source. >>> >>> From there I saw that there are two FileEnumerators -> one that allows >>> for file split and other that does not. BlockSplittingRecursiveEnumerator >>> and NonSplittingRecursiveEnumerator. >>> I was wondering if BlockSplittingRecursiveEnumerator can be used for >>> Parquet file. >>> >>> Actually does Parquet format supports reading file in blocks by >>> different threads. Do those blocks have to be "merged" later or can I just >>> read them row by row. >>> >>> Regards, >>> Krzysztof Chmielewski >>> >>> pt., 10 gru 2021 o 09:27 Roman Khachatryan <ro...@apache.org> >>> napisał(a): >>> >>>> Hi, >>>> >>>> Yes, file source does support DoP > 1. >>>> And in general, a single file can be read in parallel after FLIP-27. >>>> However, parallel reading of a single Parquet file is currently not >>>> supported AFAIK. >>>> >>>> Maybe Arvid or Fabian could shed more light here. >>>> >>>> Regards, >>>> Roman >>>> >>>> On Thu, Dec 9, 2021 at 12:03 PM Krzysztof Chmielewski >>>> <krzysiek.chmielew...@gmail.com> wrote: >>>> > >>>> > Hi, >>>> > can I have a File DataStream Source that will work with Parquet >>>> Format and have parallelism level higher than one? >>>> > >>>> > Is it possible to read Parquet file in chunks by multiple threads? >>>> > >>>> > Regards, >>>> > Krzysztof Chmielewski >>>> >>>