Re: FileSource with Parquet Format - parallelism level

Arvid Heise Wed, 15 Dec 2021 12:24:32 -0800

Hi Krzysztof,

yes you are correct if you use the new FileSource:


* <p>Please note that file blocks are only exposed by some file
systems, such as HDFS. File systems
* that do not expose block information will not create multiple file
splits per file, but keep the
* files as one source split.


For old API [1], most filesystems actually faked some block size (s3 was
32mb by default but could be overwritten with fs.local.block.size).

@Stephan Ewen <se...@apache.org>, do you think it would make sense to add
such a default block size to FileSource as a user option?

The S3 filesystem is in the distribution, please refer to [2]. I recommend
the Hadoop S3 for your use case.

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/formats/parquet/
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/filesystems/overview/

On Tue, Dec 14, 2021 at 11:51 PM Krzysztof Chmielewski <
krzysiek.chmielew...@gmail.com> wrote:

> Hi Arvid,
> thank you for your response.
>
> I did a little bit more digging and analyzing and I noticed one thing,
> Please correct me if I'm wrong.
>
> Whether the Parquet file will be read in parallel in fact depends on
> underlying file system.
> If the file system supports file blocks then we will have splits for
> individual blocks and Parquet file will be read in parallel.
>
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/enumerate/BlockSplittingRecursiveEnumerator.java#L141
>
> An example is HDFS but for local file system, we will have only one block,
> hence each file will be read only by one thread.
>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/fs/local/LocalFileStatus.java#L96
>
> Am I correct?
>
> Also one additional question.
> Can Flink File Source read files from AWS S3?
>
> Regards,
> Krzysztof Chmielewski
>
>
>
> pt., 10 gru 2021 o 15:29 Arvid Heise <ar...@apache.org> napisał(a):
>
>> Yes, Parquet files can be read in splits (=in parallel). Which enumerator
>> is used is determined here [1].
>>
>> [1]
>> https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/ParquetVectorizedInputFormat.java#L170-L170
>>
>> On Fri, Dec 10, 2021 at 11:44 AM Krzysztof Chmielewski <
>> krzysiek.chmielew...@gmail.com> wrote:
>>
>>> Hi Roman,
>>> Thank you.
>>>
>>> I'm familiar with FLIP-27 and I was analyzing the new File Source.
>>>
>>> From there I saw that there are two FileEnumerators -> one that allows
>>> for file split and other that does not. BlockSplittingRecursiveEnumerator
>>> and NonSplittingRecursiveEnumerator.
>>> I was wondering if  BlockSplittingRecursiveEnumerator can be used for
>>> Parquet file.
>>>
>>> Actually does Parquet format supports reading file in blocks by
>>> different threads. Do those blocks have to be "merged" later or can I just
>>> read them row by row.
>>>
>>> Regards,
>>> Krzysztof Chmielewski
>>>
>>> pt., 10 gru 2021 o 09:27 Roman Khachatryan <ro...@apache.org>
>>> napisał(a):
>>>
>>>> Hi,
>>>>
>>>> Yes, file source does support DoP > 1.
>>>> And in general, a single file can be read in parallel after FLIP-27.
>>>> However, parallel reading of a single Parquet file is currently not
>>>> supported AFAIK.
>>>>
>>>> Maybe Arvid or Fabian could shed more light here.
>>>>
>>>> Regards,
>>>> Roman
>>>>
>>>> On Thu, Dec 9, 2021 at 12:03 PM Krzysztof Chmielewski
>>>> <krzysiek.chmielew...@gmail.com> wrote:
>>>> >
>>>> > Hi,
>>>> > can I have a File DataStream Source that will work with Parquet
>>>> Format and have parallelism level higher than one?
>>>> >
>>>> > Is it possible to read  Parquet  file in chunks by multiple threads?
>>>> >
>>>> > Regards,
>>>> > Krzysztof Chmielewski
>>>>
>>>

Re: FileSource with Parquet Format - parallelism level

Reply via email to