Bump + adding one more issue I fixed (and by chance there's relevant report in user mailing list recently)
* [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue [1] The patch stabilizes the driver's memory usage on utilizing a huge metadata log, which was throwing OOME. 1. https://github.com/apache/spark/pull/28904 On Sun, Jun 14, 2020 at 4:14 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Bump again - hope to get some traction because these issues are either > long-standing problems or noticeable improvements (each PR has numbers/UI > graph to show the improvement). > > Fixed long-standing problems: > > * [SPARK-17604][SS] FileStreamSource: provide a new option to have > retention on input files [1] > * [SPARK-27188][SS] FileStreamSink: provide a new option to have retention > on output files [2] > > There's no logic to control the size of metadata for file stream source & > file stream sink, and it affects end users who run the streaming query with > many input files / output files in the long run. Both are to resolve > metadata growing incrementally over time. As the number of the issue > represents for SPARK-17604 it's a fairly old problem. There're at least > three relevant issues being reported on SPARK-27188. > > Improvements: > > * [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond > maxFilesPerTrigger as unread files [3] > * [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log > twice if the query restarts from compact batch [4] > * [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with > LZ4 compression on FileStream(Source/Sink)Log [5] > > Above patches provide better performance on the condition described on > each PR. Worth noting, SPARK-30946 provides pretty much better performance > (~10x) on compaction per every compact batch, whereas it also reduces down > the compact batch log file (~30% of current). > > 1. https://github.com/apache/spark/pull/28422 > 2. https://github.com/apache/spark/pull/28363 > 3. https://github.com/apache/spark/pull/27620 > 4. https://github.com/apache/spark/pull/27649 > 5. https://github.com/apache/spark/pull/27694 > > > On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Worth noting that I got similar question around local community as well. >> These reporters didn't encounter the edge-case, they're encountered the >> critical issue in the normal running of streaming query. >> >> On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> >> wrote: >> >>> (bump to expose the discussion to more readers) >>> >>> On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> >>>> Hi devs, >>>> >>>> I'm seeing more and more structured streaming end users encountered the >>>> metadata issues on file stream source and sink. They have been known-issues >>>> and there're even long-standing JIRA issues reported before, end users >>>> report them again in user@ mailing list in April. >>>> >>>> * Spark Structure Streaming | FileStreamSourceLog not deleting list of >>>> input files | Spark -2.4.0 [1] >>>> * [Structured Streaming] Checkpoint file compact file grows big [2] >>>> >>>> I've proposed various improvements on the area (see my PRs [3]) but >>>> suffered on lack of interests/reviews. I feel the issue is critical >>>> (under-estimated) because... >>>> >>>> 1. It's one of "built-in" data sources which is being maintained by >>>> Spark community. (End users may judge the state of project/area on the >>>> quality on the built-in data source, because that's the thing they would >>>> start with.) >>>> 2. It's the only built-in data source which provides "end-to-end >>>> exactly-once" in structured streaming. >>>> >>>> I'd hope to see us address such issues so that end users can live with >>>> built-in data source. (It may not need to be perfect, but at least be >>>> reasonable on the long-run streaming workloads.) I know there're couple of >>>> alternatives, but I don't think starter would start from there. End users >>>> may just try to find alternatives - not alternative of data source, but >>>> alternative of streaming processing framework. >>>> >>>> Thanks, >>>> Jungtaek Lim (HeartSaVioR) >>>> >>>> 1. >>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E >>>> 2. >>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E >>>> 3. https://github.com/apache/spark/pulls/HeartSaVioR >>>> >>>