Re: Reading FileSource Files in a particular order

Caizhi Weng Mon, 14 Mar 2022 20:04:20 -0700

Hi!

Are you running a batch job or a streaming job? For batch jobs just use
ORDER BY keyword in SQL to sort the records. For streaming jobs I'm afraid
it is hard to do so. A custom FileEnumerator might help, however if the
parallelism of your file system source is more than one then it is possible
that different parallelisms read files at different speeds, causing the
output of the file source to be random once again.


Kevin Lam <kevin....@shopify.com> 于2022年3月14日周一 22:40写道：

> Hi all,
>
> We're interested in being able to use a FileSource
> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/org/apache/flink/connector/file/src/FileSource.html>
> read from a Google Cloud Storage (GCS) archive of messages from a Kafka
> topic, roughly in order.
>
> Our GCS archive is partitioned into folders by time, however, when we read
> it using a FileSource, the messages are processed in a random order. We'd
> like to be able to control what order the files are read in, and take
> advantage of the clear ordering our GCS archive provides.
>
> What is the best way to achieve this? Would it be possible to write a
> custom FileEnumerator
> <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/org/apache/flink/connector/file/src/enumerate/FileEnumerator.html>
>  that
> sorts the directories and returns the splits in order?
>
> Any help would be greatly appreciated!
>
> Thanks,
> Kevin
>

Re: Reading FileSource Files in a particular order

Reply via email to