Re: Handling user-facing metadata issues on file stream source & sink

Jungtaek Lim Thu, 25 Jun 2020 21:30:33 -0700

Bump + adding one more issue I fixed (and by chance there's relevant report
in user mailing list recently)


* [SPARK-30462][SS] Streamline the logic on file stream source and sink to
avoid memory issue [1]

The patch stabilizes the driver's memory usage on utilizing a huge metadata
log, which was throwing OOME.

1. https://github.com/apache/spark/pull/28904

On Sun, Jun 14, 2020 at 4:14 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
wrote:

> Bump again - hope to get some traction because these issues are either
> long-standing problems or noticeable improvements (each PR has numbers/UI
> graph to show the improvement).
>
> Fixed long-standing problems:
>
> * [SPARK-17604][SS] FileStreamSource: provide a new option to have
> retention on input files [1]
> * [SPARK-27188][SS] FileStreamSink: provide a new option to have retention
> on output files [2]
>
> There's no logic to control the size of metadata for file stream source &
> file stream sink, and it affects end users who run the streaming query with
> many input files / output files in the long run. Both are to resolve
> metadata growing incrementally over time. As the number of the issue
> represents for SPARK-17604 it's a fairly old problem. There're at least
> three relevant issues being reported on SPARK-27188.
>
> Improvements:
>
> * [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond
> maxFilesPerTrigger as unread files [3]
> * [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log
> twice if the query restarts from compact batch [4]
> * [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with
> LZ4 compression on FileStream(Source/Sink)Log [5]
>
> Above patches provide better performance on the condition described on
> each PR. Worth noting, SPARK-30946 provides pretty much better performance
> (~10x) on compaction per every compact batch, whereas it also reduces down
> the compact batch log file (~30% of current).
>
> 1. https://github.com/apache/spark/pull/28422
> 2. https://github.com/apache/spark/pull/28363
> 3. https://github.com/apache/spark/pull/27620
> 4. https://github.com/apache/spark/pull/27649
> 5. https://github.com/apache/spark/pull/27694
>
>
> On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Worth noting that I got similar question around local community as well.
>> These reporters didn't encounter the edge-case, they're encountered the
>> critical issue in the normal running of streaming query.
>>
>> On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> wrote:
>>
>>> (bump to expose the discussion to more readers)
>>>
>>> On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> I'm seeing more and more structured streaming end users encountered the
>>>> metadata issues on file stream source and sink. They have been known-issues
>>>> and there're even long-standing JIRA issues reported before, end users
>>>> report them again in user@ mailing list in April.
>>>>
>>>> * Spark Structure Streaming | FileStreamSourceLog not deleting list of
>>>> input files | Spark -2.4.0 [1]
>>>> * [Structured Streaming] Checkpoint file compact file grows big [2]
>>>>
>>>> I've proposed various improvements on the area (see my PRs [3]) but
>>>> suffered on lack of interests/reviews. I feel the issue is critical
>>>> (under-estimated) because...
>>>>
>>>> 1. It's one of "built-in" data sources which is being maintained by
>>>> Spark community. (End users may judge the state of project/area on the
>>>> quality on the built-in data source, because that's the thing they would
>>>> start with.)
>>>> 2. It's the only built-in data source which provides "end-to-end
>>>> exactly-once" in structured streaming.
>>>>
>>>> I'd hope to see us address such issues so that end users can live with
>>>> built-in data source. (It may not need to be perfect, but at least be
>>>> reasonable on the long-run streaming workloads.) I know there're couple of
>>>> alternatives, but I don't think starter would start from there. End users
>>>> may just try to find alternatives - not alternative of data source, but
>>>> alternative of streaming processing framework.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1.
>>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>>> 2.
>>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>>> 3. https://github.com/apache/spark/pulls/HeartSaVioR
>>>>
>>>

Re: Handling user-facing metadata issues on file stream source & sink

Reply via email to