Hi Arvid,
thank you for your response.

I did a little bit more digging and analyzing and I noticed one thing,
Please correct me if I'm wrong.

Whether the Parquet file will be read in parallel in fact depends on
underlying file system.
If the file system supports file blocks then we will have splits for
individual blocks and Parquet file will be read in parallel.
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141

An example is HDFS but for local file system, we will have only one block,
hence each file will be read only by one thread.
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/local/LocalFileStatus.java#L96

Am I correct?

Also one additional question.
Can Flink File Source read files from AWS S3?

Regards,
Krzysztof Chmielewski



pt., 10 gru 2021 o 15:29 Arvid Heise <ar...@apache.org> napisał(a):

> Yes, Parquet files can be read in splits (=in parallel). Which enumerator
> is used is determined here [1].
>
> [1]
> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetVectorizedInputFormat.java#L170-L170
>
> On Fri, Dec 10, 2021 at 11:44 AM Krzysztof Chmielewski <
> krzysiek.chmielew...@gmail.com> wrote:
>
>> Hi Roman,
>> Thank you.
>>
>> I'm familiar with FLIP-27 and I was analyzing the new File Source.
>>
>> From there I saw that there are two FileEnumerators -> one that allows
>> for file split and other that does not. BlockSplittingRecursiveEnumerator
>> and NonSplittingRecursiveEnumerator.
>> I was wondering if  BlockSplittingRecursiveEnumerator can be used for
>> Parquet file.
>>
>> Actually does Parquet format supports reading file in blocks by different
>> threads. Do those blocks have to be "merged" later or can I just read them
>> row by row.
>>
>> Regards,
>> Krzysztof Chmielewski
>>
>> pt., 10 gru 2021 o 09:27 Roman Khachatryan <ro...@apache.org> napisał(a):
>>
>>> Hi,
>>>
>>> Yes, file source does support DoP > 1.
>>> And in general, a single file can be read in parallel after FLIP-27.
>>> However, parallel reading of a single Parquet file is currently not
>>> supported AFAIK.
>>>
>>> Maybe Arvid or Fabian could shed more light here.
>>>
>>> Regards,
>>> Roman
>>>
>>> On Thu, Dec 9, 2021 at 12:03 PM Krzysztof Chmielewski
>>> <krzysiek.chmielew...@gmail.com> wrote:
>>> >
>>> > Hi,
>>> > can I have a File DataStream Source that will work with Parquet Format
>>> and have parallelism level higher than one?
>>> >
>>> > Is it possible to read  Parquet  file in chunks by multiple threads?
>>> >
>>> > Regards,
>>> > Krzysztof Chmielewski
>>>
>>

Reply via email to