Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Yuanjian Li Thu, 26 Nov 2020 19:08:26 -0800

Nice blog! Thanks for sharing, Etienne!

Let's try to raise this discussion again after the 3.1 release. I do think
more committers/contributors had realized the issue of global watermark per
SPARK-24634 <https://issues.apache.org/jira/browse/SPARK-24634> and
SPARK-33259 <https://issues.apache.org/jira/browse/SPARK-33259>.


Leaving some thoughts on my end:
1. Compatibility: The per-operation watermark should be compatible with the
original global one when there are no multi-aggregations.
2. Versioning: If we need to change checkpoints' format, versioning info
should be added for the first time.
3. Fix more things together: We'd better fix more issues(e.g. per-operation
output mode for multi-aggregations) together, which would require
versioning changes in the same Spark version.

Best,
Yuanjian


Etienne Chauchot <echauc...@apache.org> 于2020年11月26日周四 下午5:29写道：

> Hi,
>
> Regarding this subject I wrote a blog article that gives details about the
> watermark architecture proposal that was discussed in the design doc and in
> the PR:
>
>
> https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html
>
> Best
>
> Etienne
> On 29/09/2020 03:24, Yuanjian Li wrote:
>
> Thanks for the great discussion!
>
> Also interested in this feature and did some investigation before. As Arun
> mentioned, similar to the "update" mode, the "complete" mode also needs
> more design. We might need an operation level output mode for the complete
> mode support. That is to say, if we use "complete" mode for every
> aggregation operators, the wrong result will return.
>
> SPARK-26655 would be a good start, which only considers about "append"
> mode. Maybe we need more discussion on the watermark interface. I will take
> a close look at the doc and PR. Hope we will have the first version with
> limitations and fix/remove them gradually.
>
> Best,
> Yuanjian
>
> Jungtaek Lim <kabhwan.opensou...@gmail.com> 于2020年9月26日周六 上午10:31写道：
>
>> Thanks Etienne! Yeah I forgot to say nice talking with you again. And
>> sorry I forgot to send the reply (was in draft).
>>
>> Regarding investment in SS, well, unfortunately I don't know - I'm just
>> an individual. There might be various reasons to do so, most probably
>> "priority" among the stuff. There's not much I could change.
>>
>> I agree the workaround is sub-optimal, but unless I see sufficient
>> support in the community I probably couldn't make it go forward. I'll just
>> say there's an elephant in the room - as the project goes forward for more
>> than 10 years, backward compatibility is a top priority concern in the
>> project, even across the major versions along the features/APIs. It is
>> great for end users to migrate the version easily, but also blocks devs to
>> fix the bad design once it ships. I'm the one complaining about these
>> issues in the dev list, and I don't see willingness to correct them.
>>
>>
>> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot <echauc...@apache.org>
>> wrote:
>>
>>> Hi Jungtaek Lim,
>>>
>>> Nice to hear from you again since last time we talked :) and congrats on
>>> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
>>> not at the time)
>>>
>>> I totally agree with what you're saying on merging structural parts of
>>> Spark without having a broader consensus. What I don't understand is why
>>> there is not more investment in SS. Especially because in another thread
>>> the community is discussing about deprecating the regular DStream streaming
>>> framework.
>>>
>>> Is the orientation of Spark now mostly batch ?
>>>
>>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview
>>> 2 searching for this particular feature. And regarding the workaround, I'm
>>> not sure it meets my needs as it will add delays and also may mess up with
>>> watermarks.
>>>
>>> Best
>>>
>>> Etienne Chauchot
>>>
>>>
>>> On 04/09/2020 08:06, Jungtaek Lim wrote:
>>>
>>> Unfortunately I don't see enough active committers working on Structured
>>> Streaming; I don't expect major features/improvements can be brought in
>>> this situation.
>>>
>>> Technically I can review and merge the PR on major improvements in SS,
>>> but that depends on how huge the proposal is changing. If the proposal
>>> brings conceptual change, being reviewed by a committer wouldn't still be
>>> enough.
>>>
>>> So that's not due to the fact we think it's worthless. (That might be
>>> only me though.) I'd understand as there's not much investment on SS.
>>> There's also a known workaround for multiple aggregations (I've documented
>>> in the SS guide doc, in "Limitation of global watermark" section), though I
>>> totally agree the workaround is bad.
>>>
>>> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot <echauc...@apache.org>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm also very interested in this feature but the PR is open since
>>>> January 2019 and was not updated. It raised a design discussion around
>>>> watermarks and a design doc was written (
>>>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>>>> We also commented this design but no matter what it seems that the subject
>>>> is still stale.
>>>>
>>>> Is there any interest in the community in delivering this feature or is
>>>> it considered worthless ? If the latter, can you explain why ?
>>>>
>>>> Best
>>>>
>>>> Etienne
>>>> On 22/05/2019 03:38, 张万新 wrote:
>>>>
>>>> Thanks, I'll check it out.
>>>>
>>>> Arun Mahadevan <ar...@apache.org> 于 2019年5月21日周二 01:31写道：
>>>>
>>>>> Heres the proposal for supporting it in "append" mode -
>>>>> https://github.com/apache/spark/pull/23576. You could see if it
>>>>> addresses your requirement and post your feedback in the PR.
>>>>> For "update" mode its going to be much harder to support this without
>>>>> first adding support for "retractions", otherwise we would end up with
>>>>> wrong results.
>>>>>
>>>>> - Arun
>>>>>
>>>>>
>>>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi <gabor.g.somo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> There is PR for this but not yet merged.
>>>>>>
>>>>>> On Mon, May 20, 2019 at 10:13 AM 张万新 <kevinzwx1...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I'd like to know what's the root reason why multiple aggregations on
>>>>>>> streaming dataframe is not allowed since it's a very useful feature, and
>>>>>>> flink has supported it for a long time.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Reply via email to