I had a doubt regarding watermarking w.r.t streaming with FileSource. Will really appreciate it if somebody can explain this behavior.
Consider a filesystem with a root folder containing date wise sub folders such as *D1* , *D2*, … and so on. Each of these date folders further has 24 sub-folders inside corresponding to data generated for each hour, i:e, *hr=0, hr=1, hr=2*,… and so on. Each hourly folder has only one file inside it. So there will be 24 files total. The file structure will look something like this: - - *ROOT FOLDER* - - D1 - - hr=0 - - somefile.txt - - hr=1 - - somefile.txt - … (more hour folders) - - D2 - - hr=0 - - somefile.txt - - hr=1 - - somefile.txt - … (more hour folders) - …(more date folders) Let’s assume we are running a Flink job with a File Source and parallelism of 15. Let’s say, one task manager has one slot. Also, let’s assume we want one split per file (NonSplittingRecursiveEnumerator is used) Also, let’s assume a case where each task manager picks up one file split in the beginning. For the very sake of simplicity, let’s also assume the splits are generated and picked in chronological order. So files from *hr=0* till *hr=14* will be picked up respectively by the 15 task managers. Total 15 files are picked for processing. Now, even if all the task managers finish reading their files at the same or different time, the next file split, i:e *hr=15* can be picked up by any task manager. Unless otherwise the task manager which also processed *hr=14* file picks this *hr=15* file as well, rows from this file will always be dropped by other task managers unless the watermark interval is really huge, like 1 hour+. Am I thinking about this correctly ? Is the solution then to keep a really big watermark interval for a FileSource ? Or is there an idiomatic pattern to solve these kinds of problems with FileSource ? -- *Regards,* *Meghajit*