Re: [DISCUSS] FLIP-188: Introduce Built-in Dynamic Table Storage

Jingsong Li Thu, 06 Jan 2022 23:30:31 -0800

Hi Timo,

I think we can consider exposing to DataStream users in the future, if
the API definition is clear after.
I am fine with `flink-table-store` too.
But I tend to prefer shorter and clearer name:
`flink-store`.


I think I can create a separate thread to vote.

Looking forward to your thoughts!

Best,
Jingsong


On Thu, Dec 30, 2021 at 9:48 PM Timo Walther <[email protected]> wrote:
>
> +1 for a separate repository. And also +1 for finding a good name.
>
> `flink-warehouse` would be definitely a good marketing name but I agree
> that we should not start marketing for code bases. Are we planning to
> make this storage also available to DataStream API users? If not, I
> would also vote for `flink-managed-table` or better: `flink-table-store`
>
> Thanks,
> Timo
>
>
>
> On 29.12.21 07:58, Jingsong Li wrote:
> > Thanks Till for your suggestions.
> >
> > Personally, I like flink-warehouse, this is what we want to convey to
> > the user, but it indicates a bit too much scope.
> >
> > How about just calling it flink-store?
> > Simply to convey an impression: this is flink's store project,
> > providing a built-in store for the flink compute engine, which can be
> > used by flink-table as well as flink-datastream.
> >
> > Best,
> > Jingsong
> >
> > On Tue, Dec 28, 2021 at 5:15 PM Till Rohrmann <[email protected]> wrote:
> >>
> >> Hi Jingsong,
> >>
> >> I think that developing flink-dynamic-storage as a separate sub project is
> >> a very good idea since it allows us to move a lot faster and decouple
> >> releases from Flink. Hence big +1.
> >>
> >> Do we want to name it flink-dynamic-storage or shall we use a more
> >> descriptive name? dynamic-storage sounds a bit generic to me and I wouldn't
> >> know that this has something to do with letting Flink manage your tables
> >> and their storage. I don't have a very good idea but maybe we can call it
> >> flink-managed-tables, flink-warehouse, flink-olap or so.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Tue, Dec 28, 2021 at 9:49 AM Martijn Visser <[email protected]>
> >> wrote:
> >>
> >>> Hi Jingsong,
> >>>
> >>> That sounds promising! +1 from my side to continue development under
> >>> flink-dynamic-storage as a Flink subproject. I think having a more 
> >>> in-depth
> >>> interface will benefit everyone.
> >>>
> >>> Best regards,
> >>>
> >>> Martijn
> >>>
> >>> On Tue, 28 Dec 2021 at 04:23, Jingsong Li <[email protected]> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> After some experimentation, we felt no problem putting the dynamic
> >>>> storage outside of flink, and it also allowed us to design the
> >>>> interface in more depth.
> >>>>
> >>>> What do you think? If there is no problem, I am asking for PMC's help
> >>>> here: we want to propose flink-dynamic-storage as a flink subproject,
> >>>> and we want to build the project under apache.
> >>>>
> >>>> Best,
> >>>> Jingsong
> >>>>
> >>>>
> >>>> On Wed, Nov 24, 2021 at 8:10 PM Jingsong Li <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Hi Stephan,
> >>>>>
> >>>>> Thanks for your reply.
> >>>>>
> >>>>> Data never expires automatically.
> >>>>>
> >>>>> If there is a need for data retention, the user can choose one of the
> >>>>> following options:
> >>>>> - In the SQL for querying the managed table, users filter the data by
> >>>> themselves
> >>>>> - Define the time partition, and users can delete the expired
> >>>>> partition by themselves. (DROP PARTITION ...)
> >>>>> - In the future version, we will support the "DELETE FROM" statement,
> >>>>> users can delete the expired data according to the conditions.
> >>>>>
> >>>>> So to answer your question:
> >>>>>
> >>>>>> Will the VMQ send retractions so that the data will be removed from
> >>>> the table (via compactions)?
> >>>>>
> >>>>> The current implementation is not sending retraction, which I think
> >>>>> theoretically should be sent, currently the user can filter by
> >>>>> subsequent conditions.
> >>>>> And yes, the subscriber would not see strictly a correct result. I
> >>>>> think this is something we can improve for Flink SQL.
> >>>>>
> >>>>>> Do we want time retention semantics handled by the compaction?
> >>>>>
> >>>>> Currently, no, Data never expires automatically.
> >>>>>
> >>>>>> Do we want to declare those types of queries "out of scope" initially?
> >>>>>
> >>>>> I think we want users to be able to use three options above to
> >>>>> accomplish their requirements.
> >>>>>
> >>>>> I will update FLIP to make the definition clearer and more explicit.
> >>>>>
> >>>>> Best,
> >>>>> Jingsong
> >>>>>
> >>>>> On Wed, Nov 24, 2021 at 5:01 AM Stephan Ewen <[email protected]>
> >>>> wrote:
> >>>>>>
> >>>>>> Thanks for digging into this.
> >>>>>> Regarding this query:
> >>>>>>
> >>>>>> INSERT INTO the_table
> >>>>>>    SELECT window_end, COUNT(*)
> >>>>>>      FROM (TUMBLE(TABLE interactions, DESCRIPTOR(ts), INTERVAL '5'
> >>>> MINUTES))
> >>>>>> GROUP BY window_end
> >>>>>>    HAVING now() - window_end <= INTERVAL '14' DAYS;
> >>>>>>
> >>>>>> I am not sure I understand what the conclusion is on the data
> >>>> retention question, where the continuous streaming SQL query has 
> >>>> retention
> >>>> semantics. I think we would need to answer the following questions (I 
> >>>> will
> >>>> call the query that computed the managed table the "view materializer
> >>>> query" - VMQ).
> >>>>>>
> >>>>>> (1) I guess the VMQ will send no updates for windows beyond the
> >>>> "retention period" is over (14 days), as you said. That makes sense.
> >>>>>>
> >>>>>> (2) Will the VMQ send retractions so that the data will be removed
> >>>> from the table (via compactions)?
> >>>>>>    - if yes, this seems semantically better for users, but it will be
> >>>> expensive to keep the timers for retractions.
> >>>>>>    - if not, we can still solve this by adding filters to queries
> >>>> against the managed table, as long as these queries are in Flink.
> >>>>>>    - any subscriber to the changelog stream would not see strictly a
> >>>> correct result if we are not doing the retractions
> >>>>>>
> >>>>>> (3) Do we want time retention semantics handled by the compaction?
> >>>>>>    - if we say that we lazily apply the deletes in the queries that
> >>>> read the managed tables, then we could also age out the old data during
> >>>> compaction.
> >>>>>>    - that is cheap, but it might be too much of a special case to be
> >>>> very relevant here.
> >>>>>>
> >>>>>> (4) Do we want to declare those types of queries "out of scope"
> >>>> initially?
> >>>>>>    - if yes, how many users are we affecting? (I guess probably not
> >>>> many, but would be good to hear some thoughts from others on this)
> >>>>>>    - should we simply reject such queries in the optimizer as "not
> >>>> possible to support in managed tables"? I would suggest that, always 
> >>>> better
> >>>> to tell users exactly what works and what not, rather than letting them 
> >>>> be
> >>>> surprised in the end. Users can still remove the HAVING clause if they 
> >>>> want
> >>>> the query to run, and that would be better than if the VMQ just silently
> >>>> ignores those semantics.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Stephan
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best, Jingsong Lee
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best, Jingsong Lee
> >>>>
> >>>
> >
> >
> >
>


--
Best, Jingsong Lee

Re: [DISCUSS] FLIP-188: Introduce Built-in Dynamic Table Storage

Reply via email to