Hi devs, I'm seeing more and more structured streaming end users encountered the metadata issues on file stream source and sink. They have been known-issues and there're even long-standing JIRA issues reported before, end users report them again in user@ mailing list in April.
* Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0 [1] * [Structured Streaming] Checkpoint file compact file grows big [2] I've proposed various improvements on the area (see my PRs [3]) but suffered on lack of interests/reviews. I feel the issue is critical (under-estimated) because... 1. It's one of "built-in" data sources which is being maintained by Spark community. (End users may judge the state of project/area on the quality on the built-in data source, because that's the thing they would start with.) 2. It's the only built-in data source which provides "end-to-end exactly-once" in structured streaming. I'd hope to see us address such issues so that end users can live with built-in data source. (It may not need to be perfect, but at least be reasonable on the long-run streaming workloads.) I know there're couple of alternatives, but I don't think starter would start from there. End users may just try to find alternatives - not alternative of data source, but alternative of streaming processing framework. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E 2. https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E 3. https://github.com/apache/spark/pulls/HeartSaVioR