Hi devs,

As I have been going through the various issues on metadata log growing,
it's not only the issue of sink, but also the issue of source.
Unlike sink metadata log which entries should be available to the readers,
the source metadata log is only for the streaming query starting
from the checkpoint, hence in theory it should only memorize about minimal
entries which prevent processing multiple times on the same file.

This is not applied to the file stream source, and I think it's because of
the existence of the "latestFirst" option which I haven't seen from any
sources. The option works as reading files in "backward" order, which means
Spark can read the oldest file and latest file together in a micro-batch,
which ends up having to memorize all files previously read. The option can
be changed during query restart, so even if the query is started with
"latestFirst" being false, it's not safe to apply the logic of minimizing
entries to memorize, as the option can be changed to true and then we'll
read files again.

I'm seeing two approaches here:

1) apply "retention" - unlike "maxFileAge", the option would apply to
latestFirst as well. That said, if the retention is set to 7 days, the
files older than 7 days would never be read in any way. With this approach
we can at least get rid of entries which are older than retention. The
issue is how to play nicely with existing "maxFileAge", as it also plays
similar with the retention, though it's being ignored when latestFirst is
turned on. (Change the semantic of "maxFileAge" vs leave it to "soft
retention" and introduce another option.)

(This approach is being proposed under SPARK-17604, and PR is available -
https://github.com/apache/spark/pull/28422)

2) replace "latestFirst" option with alternatives, which no longer read in
"backward" order - this doesn't say we have to read all files to move
forward. As we do with Kafka, start offset can be provided, ideally as a
timestamp, which Spark will read from such timestamp and forward order.
This doesn't cover all use cases of "latestFirst", but "latestFirst"
doesn't seem to be natural with the concept of SS (think about watermark),
I'd prefer to support alternatives instead of struggling with "latestFirst".

Would like to hear your opinions.

Thanks,
Jungtaek Lim (HeartSaVioR)

Reply via email to