Hi! Are you running a batch job or a streaming job? For batch jobs just use ORDER BY keyword in SQL to sort the records. For streaming jobs I'm afraid it is hard to do so. A custom FileEnumerator might help, however if the parallelism of your file system source is more than one then it is possible that different parallelisms read files at different speeds, causing the output of the file source to be random once again.
Kevin Lam <kevin....@shopify.com> 于2022年3月14日周一 22:40写道: > Hi all, > > We're interested in being able to use a FileSource > <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/org/apache/flink/connector/file/src/FileSource.html> > read from a Google Cloud Storage (GCS) archive of messages from a Kafka > topic, roughly in order. > > Our GCS archive is partitioned into folders by time, however, when we read > it using a FileSource, the messages are processed in a random order. We'd > like to be able to control what order the files are read in, and take > advantage of the clear ordering our GCS archive provides. > > What is the best way to achieve this? Would it be possible to write a > custom FileEnumerator > <https://nightlies.apache.org/flink/flink-docs-release-1.14/api/java/org/apache/flink/connector/file/src/enumerate/FileEnumerator.html> > that > sorts the directories and returns the splits in order? > > Any help would be greatly appreciated! > > Thanks, > Kevin >