Hi Stephan, Thanks for your reply.
Data never expires automatically. If there is a need for data retention, the user can choose one of the following options: - In the SQL for querying the managed table, users filter the data by themselves - Define the time partition, and users can delete the expired partition by themselves. (DROP PARTITION ...) - In the future version, we will support the "DELETE FROM" statement, users can delete the expired data according to the conditions. So to answer your question: > Will the VMQ send retractions so that the data will be removed from the table > (via compactions)? The current implementation is not sending retraction, which I think theoretically should be sent, currently the user can filter by subsequent conditions. And yes, the subscriber would not see strictly a correct result. I think this is something we can improve for Flink SQL. > Do we want time retention semantics handled by the compaction? Currently, no, Data never expires automatically. > Do we want to declare those types of queries "out of scope" initially? I think we want users to be able to use three options above to accomplish their requirements. I will update FLIP to make the definition clearer and more explicit. Best, Jingsong On Wed, Nov 24, 2021 at 5:01 AM Stephan Ewen <ewenstep...@gmail.com> wrote: > > Thanks for digging into this. > Regarding this query: > > INSERT INTO the_table > SELECT window_end, COUNT(*) > FROM (TUMBLE(TABLE interactions, DESCRIPTOR(ts), INTERVAL '5' MINUTES)) > GROUP BY window_end > HAVING now() - window_end <= INTERVAL '14' DAYS; > > I am not sure I understand what the conclusion is on the data retention > question, where the continuous streaming SQL query has retention semantics. I > think we would need to answer the following questions (I will call the query > that computed the managed table the "view materializer query" - VMQ). > > (1) I guess the VMQ will send no updates for windows beyond the "retention > period" is over (14 days), as you said. That makes sense. > > (2) Will the VMQ send retractions so that the data will be removed from the > table (via compactions)? > - if yes, this seems semantically better for users, but it will be > expensive to keep the timers for retractions. > - if not, we can still solve this by adding filters to queries against the > managed table, as long as these queries are in Flink. > - any subscriber to the changelog stream would not see strictly a correct > result if we are not doing the retractions > > (3) Do we want time retention semantics handled by the compaction? > - if we say that we lazily apply the deletes in the queries that read the > managed tables, then we could also age out the old data during compaction. > - that is cheap, but it might be too much of a special case to be very > relevant here. > > (4) Do we want to declare those types of queries "out of scope" initially? > - if yes, how many users are we affecting? (I guess probably not many, but > would be good to hear some thoughts from others on this) > - should we simply reject such queries in the optimizer as "not possible to > support in managed tables"? I would suggest that, always better to tell users > exactly what works and what not, rather than letting them be surprised in the > end. Users can still remove the HAVING clause if they want the query to run, > and that would be better than if the VMQ just silently ignores those > semantics. > > Thanks, > Stephan > -- Best, Jingsong Lee