Hi Jingsong, I think that developing flink-dynamic-storage as a separate sub project is a very good idea since it allows us to move a lot faster and decouple releases from Flink. Hence big +1.
Do we want to name it flink-dynamic-storage or shall we use a more descriptive name? dynamic-storage sounds a bit generic to me and I wouldn't know that this has something to do with letting Flink manage your tables and their storage. I don't have a very good idea but maybe we can call it flink-managed-tables, flink-warehouse, flink-olap or so. Cheers, Till On Tue, Dec 28, 2021 at 9:49 AM Martijn Visser <mart...@ververica.com> wrote: > Hi Jingsong, > > That sounds promising! +1 from my side to continue development under > flink-dynamic-storage as a Flink subproject. I think having a more in-depth > interface will benefit everyone. > > Best regards, > > Martijn > > On Tue, 28 Dec 2021 at 04:23, Jingsong Li <jingsongl...@gmail.com> wrote: > >> Hi all, >> >> After some experimentation, we felt no problem putting the dynamic >> storage outside of flink, and it also allowed us to design the >> interface in more depth. >> >> What do you think? If there is no problem, I am asking for PMC's help >> here: we want to propose flink-dynamic-storage as a flink subproject, >> and we want to build the project under apache. >> >> Best, >> Jingsong >> >> >> On Wed, Nov 24, 2021 at 8:10 PM Jingsong Li <jingsongl...@gmail.com> >> wrote: >> > >> > Hi Stephan, >> > >> > Thanks for your reply. >> > >> > Data never expires automatically. >> > >> > If there is a need for data retention, the user can choose one of the >> > following options: >> > - In the SQL for querying the managed table, users filter the data by >> themselves >> > - Define the time partition, and users can delete the expired >> > partition by themselves. (DROP PARTITION ...) >> > - In the future version, we will support the "DELETE FROM" statement, >> > users can delete the expired data according to the conditions. >> > >> > So to answer your question: >> > >> > > Will the VMQ send retractions so that the data will be removed from >> the table (via compactions)? >> > >> > The current implementation is not sending retraction, which I think >> > theoretically should be sent, currently the user can filter by >> > subsequent conditions. >> > And yes, the subscriber would not see strictly a correct result. I >> > think this is something we can improve for Flink SQL. >> > >> > > Do we want time retention semantics handled by the compaction? >> > >> > Currently, no, Data never expires automatically. >> > >> > > Do we want to declare those types of queries "out of scope" initially? >> > >> > I think we want users to be able to use three options above to >> > accomplish their requirements. >> > >> > I will update FLIP to make the definition clearer and more explicit. >> > >> > Best, >> > Jingsong >> > >> > On Wed, Nov 24, 2021 at 5:01 AM Stephan Ewen <ewenstep...@gmail.com> >> wrote: >> > > >> > > Thanks for digging into this. >> > > Regarding this query: >> > > >> > > INSERT INTO the_table >> > > SELECT window_end, COUNT(*) >> > > FROM (TUMBLE(TABLE interactions, DESCRIPTOR(ts), INTERVAL '5' >> MINUTES)) >> > > GROUP BY window_end >> > > HAVING now() - window_end <= INTERVAL '14' DAYS; >> > > >> > > I am not sure I understand what the conclusion is on the data >> retention question, where the continuous streaming SQL query has retention >> semantics. I think we would need to answer the following questions (I will >> call the query that computed the managed table the "view materializer >> query" - VMQ). >> > > >> > > (1) I guess the VMQ will send no updates for windows beyond the >> "retention period" is over (14 days), as you said. That makes sense. >> > > >> > > (2) Will the VMQ send retractions so that the data will be removed >> from the table (via compactions)? >> > > - if yes, this seems semantically better for users, but it will be >> expensive to keep the timers for retractions. >> > > - if not, we can still solve this by adding filters to queries >> against the managed table, as long as these queries are in Flink. >> > > - any subscriber to the changelog stream would not see strictly a >> correct result if we are not doing the retractions >> > > >> > > (3) Do we want time retention semantics handled by the compaction? >> > > - if we say that we lazily apply the deletes in the queries that >> read the managed tables, then we could also age out the old data during >> compaction. >> > > - that is cheap, but it might be too much of a special case to be >> very relevant here. >> > > >> > > (4) Do we want to declare those types of queries "out of scope" >> initially? >> > > - if yes, how many users are we affecting? (I guess probably not >> many, but would be good to hear some thoughts from others on this) >> > > - should we simply reject such queries in the optimizer as "not >> possible to support in managed tables"? I would suggest that, always better >> to tell users exactly what works and what not, rather than letting them be >> surprised in the end. Users can still remove the HAVING clause if they want >> the query to run, and that would be better than if the VMQ just silently >> ignores those semantics. >> > > >> > > Thanks, >> > > Stephan >> > > >> > >> > >> > -- >> > Best, Jingsong Lee >> >> >> >> -- >> Best, Jingsong Lee >> >