Nice blog! Thanks for sharing, Etienne! Let's try to raise this discussion again after the 3.1 release. I do think more committers/contributors had realized the issue of global watermark per SPARK-24634 <https://issues.apache.org/jira/browse/SPARK-24634> and SPARK-33259 <https://issues.apache.org/jira/browse/SPARK-33259>.
Leaving some thoughts on my end: 1. Compatibility: The per-operation watermark should be compatible with the original global one when there are no multi-aggregations. 2. Versioning: If we need to change checkpoints' format, versioning info should be added for the first time. 3. Fix more things together: We'd better fix more issues(e.g. per-operation output mode for multi-aggregations) together, which would require versioning changes in the same Spark version. Best, Yuanjian Etienne Chauchot <echauc...@apache.org> 于2020年11月26日周四 下午5:29写道: > Hi, > > Regarding this subject I wrote a blog article that gives details about the > watermark architecture proposal that was discussed in the design doc and in > the PR: > > > https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html > > Best > > Etienne > On 29/09/2020 03:24, Yuanjian Li wrote: > > Thanks for the great discussion! > > Also interested in this feature and did some investigation before. As Arun > mentioned, similar to the "update" mode, the "complete" mode also needs > more design. We might need an operation level output mode for the complete > mode support. That is to say, if we use "complete" mode for every > aggregation operators, the wrong result will return. > > SPARK-26655 would be a good start, which only considers about "append" > mode. Maybe we need more discussion on the watermark interface. I will take > a close look at the doc and PR. Hope we will have the first version with > limitations and fix/remove them gradually. > > Best, > Yuanjian > > Jungtaek Lim <kabhwan.opensou...@gmail.com> 于2020年9月26日周六 上午10:31写道: > >> Thanks Etienne! Yeah I forgot to say nice talking with you again. And >> sorry I forgot to send the reply (was in draft). >> >> Regarding investment in SS, well, unfortunately I don't know - I'm just >> an individual. There might be various reasons to do so, most probably >> "priority" among the stuff. There's not much I could change. >> >> I agree the workaround is sub-optimal, but unless I see sufficient >> support in the community I probably couldn't make it go forward. I'll just >> say there's an elephant in the room - as the project goes forward for more >> than 10 years, backward compatibility is a top priority concern in the >> project, even across the major versions along the features/APIs. It is >> great for end users to migrate the version easily, but also blocks devs to >> fix the bad design once it ships. I'm the one complaining about these >> issues in the dev list, and I don't see willingness to correct them. >> >> >> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot <echauc...@apache.org> >> wrote: >> >>> Hi Jungtaek Lim, >>> >>> Nice to hear from you again since last time we talked :) and congrats on >>> becoming a Spark committer in the meantime ! (if I'm not mistaking you were >>> not at the time) >>> >>> I totally agree with what you're saying on merging structural parts of >>> Spark without having a broader consensus. What I don't understand is why >>> there is not more investment in SS. Especially because in another thread >>> the community is discussing about deprecating the regular DStream streaming >>> framework. >>> >>> Is the orientation of Spark now mostly batch ? >>> >>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview >>> 2 searching for this particular feature. And regarding the workaround, I'm >>> not sure it meets my needs as it will add delays and also may mess up with >>> watermarks. >>> >>> Best >>> >>> Etienne Chauchot >>> >>> >>> On 04/09/2020 08:06, Jungtaek Lim wrote: >>> >>> Unfortunately I don't see enough active committers working on Structured >>> Streaming; I don't expect major features/improvements can be brought in >>> this situation. >>> >>> Technically I can review and merge the PR on major improvements in SS, >>> but that depends on how huge the proposal is changing. If the proposal >>> brings conceptual change, being reviewed by a committer wouldn't still be >>> enough. >>> >>> So that's not due to the fact we think it's worthless. (That might be >>> only me though.) I'd understand as there's not much investment on SS. >>> There's also a known workaround for multiple aggregations (I've documented >>> in the SS guide doc, in "Limitation of global watermark" section), though I >>> totally agree the workaround is bad. >>> >>> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <echauc...@apache.org> >>> wrote: >>> >>>> Hi all, >>>> >>>> I'm also very interested in this feature but the PR is open since >>>> January 2019 and was not updated. It raised a design discussion around >>>> watermarks and a design doc was written ( >>>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1). >>>> We also commented this design but no matter what it seems that the subject >>>> is still stale. >>>> >>>> Is there any interest in the community in delivering this feature or is >>>> it considered worthless ? If the latter, can you explain why ? >>>> >>>> Best >>>> >>>> Etienne >>>> On 22/05/2019 03:38, 张万新 wrote: >>>> >>>> Thanks, I'll check it out. >>>> >>>> Arun Mahadevan <ar...@apache.org> 于 2019年5月21日周二 01:31写道: >>>> >>>>> Heres the proposal for supporting it in "append" mode - >>>>> https://github.com/apache/spark/pull/23576. You could see if it >>>>> addresses your requirement and post your feedback in the PR. >>>>> For "update" mode its going to be much harder to support this without >>>>> first adding support for "retractions", otherwise we would end up with >>>>> wrong results. >>>>> >>>>> - Arun >>>>> >>>>> >>>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <gabor.g.somo...@gmail.com> >>>>> wrote: >>>>> >>>>>> There is PR for this but not yet merged. >>>>>> >>>>>> On Mon, May 20, 2019 at 10:13 AM 张万新 <kevinzwx1...@gmail.com> wrote: >>>>>> >>>>>>> Hi there, >>>>>>> >>>>>>> I'd like to know what's the root reason why multiple aggregations on >>>>>>> streaming dataframe is not allowed since it's a very useful feature, and >>>>>>> flink has supported it for a long time. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>