Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Etienne Chauchot Thu, 26 Nov 2020 01:29:50 -0800

Hi,

Regarding this subject I wrote a blog article that gives details aboutthe watermark architecture proposal that was discussed in the design docand in the PR:


https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html

Best

Etienne

On 29/09/2020 03:24, Yuanjian Li wrote:

Thanks for the great discussion!

Also interested in this feature and did some investigation before. AsArun mentioned, similar to the "update" mode, the "complete" mode alsoneeds more design. We might need an operation level output mode forthe complete mode support. That is to say, if we use "complete" modefor every aggregation operators, the wrong result will return.

SPARK-26655 would be a good start, which only considers about "append"mode. Maybe we need more discussion on the watermark interface. I willtake a close look at the doc and PR. Hope we will have the firstversion with limitations and fix/remove them gradually.


Best,
Yuanjian

Jungtaek Lim <kabhwan.opensou...@gmail.com<mailto:kabhwan.opensou...@gmail.com>> 于2020年9月26日周六上午10:31写道：


    Thanks Etienne! Yeah I forgot to say nice talking with you again.
    And sorry I forgot to send the reply (was in draft).

    Regarding investment in SS, well, unfortunately I don't know - I'm
    just an individual. There might be various reasons to do so, most
    probably "priority" among the stuff. There's not much I could change.

    I agree the workaround is sub-optimal, but unless I see sufficient
    support in the community I probably couldn't make it go forward.
    I'll just say there's an elephant in the room - as the project
    goes forward for more than 10 years, backward compatibility is a
    top priority concern in the project, even across the major
    versions along the features/APIs. It is great for end users to
    migrate the version easily, but also blocks devs to fix the bad
    design once it ships. I'm the one complaining about these issues
    in the dev list, and I don't see willingness to correct them.


    On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot
    <echauc...@apache.org <mailto:echauc...@apache.org>> wrote:

        Hi Jungtaek Lim,

        Nice to hear from you again since last time we talked :) and
        congrats on becoming a Spark committer in the meantime ! (if
        I'm not mistaking you were not at the time)

        I totally agree with what you're saying on merging structural
        parts of Spark without having a broader consensus. What I
        don't understand is why there is not more investment in SS.
        Especially because in another thread the community is
        discussing about deprecating the regular DStream streaming
        framework.

        Is the orientation of Spark now mostly batch ?

        PS: yeah I saw your update on the doc when I took a look at
        3.0 preview 2 searching for this particular feature. And
        regarding the workaround, I'm not sure it meets my needs as it
        will add delays and also may mess up with watermarks.

        Best

        Etienne Chauchot


        On 04/09/2020 08:06, Jungtaek Lim wrote:

Unfortunately I don't see enough active committers working on
Structured Streaming; I don't expect major
features/improvements can be brought in this situation.

Technically I can review and merge the PR on major
improvements in SS, but that depends on how huge the proposal
is changing. If the proposal brings conceptual change, being
reviewed by a committer wouldn't still be enough.

So that's not due to the fact we think it's worthless. (That
might be only me though.) I'd understand as there's not much
investment on SS. There's also a known workaround for
multiple aggregations (I've documented in the SS guide doc,
in "Limitation of global watermark" section), though I
totally agree the workaround is bad.

On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot
<echauc...@apache.org <mailto:echauc...@apache.org>> wrote:

Hi all,

I'm also very interested in this feature but the PR is
open since January 2019 and was not updated. It raised a
design discussion around watermarks and a design doc was
written

(https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
We also commented this design but no matter what it seems
that the subject is still stale.

Is there any interest in the community in delivering this
feature or is it considered worthless ? If the latter,
can you explain why ?

Best

Etienne

On 22/05/2019 03:38, 张万新 wrote:

            Thanks, I'll check it out.

            Arun Mahadevan <ar...@apache.org
            <mailto:ar...@apache.org>> 于 2019年5月21日周二 01:31写道：

                Heres the proposal for supporting it in "append"
                mode - https://github.com/apache/spark/pull/23576.
                You could see if it addresses your requirement and
                post your feedback in the PR.
                For "update" mode its going to be much harder to
                support this without first adding support for
                "retractions", otherwise we would end up with wrong
                results.

                - Arun


                On Mon, 20 May 2019 at 01:34, Gabor Somogyi
                <gabor.g.somo...@gmail.com
                <mailto:gabor.g.somo...@gmail.com>> wrote:

                    There is PR for this but not yet merged.

                    On Mon, May 20, 2019 at 10:13 AM 张万新
                    <kevinzwx1...@gmail.com
                    <mailto:kevinzwx1...@gmail.com>> wrote:

                        Hi there,

                        I'd like to know what's the root reason why
                        multiple aggregations on streaming dataframe
                        is not allowed since it's a very useful
                        feature, and flink has supported it for a
                        long time.

                        Thanks.

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

Reply via email to