Hi Becket, Thanks for the response.
1. I wasn’t saying that materialised view must be mutable or not. The same thing applies to caches as well. To the contrary, I would expect more consistency and updates from something that is called “cache” vs something that’s a “materialised view”. In other words, IMO most caches do not serve you invalid/outdated data and they handle updates on their own. 2. I don’t think that having in the future two very similar concepts of `materialized` view and `cache` is a good idea. It would be confusing for the users. I think it could be handled by variations/overloading of materialised view concept. We could start with: `MaterializedTable materialize()` - immutable, session life scope (basically the same semantic as you are proposing And then in the future (if ever) build on top of that/expand it with: `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable materialize(refreshHook=…)` Or with cross session support: `MaterializedTable materializeInto(connector=…)` or `MaterializedTable materializeInto(tableFactory=…)` I’m not saying that we should implement cross session/refreshing now or even in the near future. I’m just arguing that naming current immutable session life scope method `materialize()` is more future proof and more consistent with SQL (on which after all table-api is heavily basing on). 3. Even if we agree on naming it `cache()`, I would still insist on `cache()` returning `CachedTable` handle to avoid implicit behaviours/side effects and to give both us & users more flexibility. Piotrek > On 29 Nov 2018, at 06:20, Becket Qin <becket....@gmail.com> wrote: > > Just to add a little bit, the materialized view is probably more similar to > the persistent() brought up earlier in the thread. So it is usually cross > session and could be used in a larger scope. For example, a materialized > view created by user A may be visible to user B. It is probably something > we want to have in the future. I'll put it in the future work section. > > Thanks, > > Jiangjie (Becket) Qin > > On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <becket....@gmail.com> wrote: > >> Hi Piotrek, >> >> Thanks for the explanation. >> >> Right now we are mostly thinking of the cached table as immutable. I can >> see the Materialized view would be useful in the future. That said, I think >> a simple cache mechanism is probably still needed. So to me, cache() and >> materialize() should be two separate method as they address different >> needs. Materialize() is a higher level concept usually implying periodical >> update, while cache() has much simpler semantic. For example, one may >> create a materialized view and use cache() method in the materialized view >> creation logic. So that during the materialized view update, they do not >> need to worry about the case that the cached table is also changed. Maybe >> under the hood, materialized() and cache() could share some mechanism, but >> I think a simple cache() method would be handy in a lot of cases. >> >> Thanks, >> >> Jiangjie (Becket) Qin >> >> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <pi...@data-artisans.com> >> wrote: >> >>> Hi Becket, >>> >>>> Is there any extra thing user can do on a MaterializedTable that they >>> cannot do on a Table? >>> >>> Maybe not in the initial implementation, but various DBs offer different >>> ways to “refresh” the materialised view. Hooks, triggers, timers, manually >>> etc. Having `MaterializedTable` would help us to handle that in the future. >>> >>>> After users call *table.cache(), *users can just use that table and do >>> anything that is supported on a Table, including SQL. >>> >>> This is some implicit behaviour with side effects. Imagine if user has a >>> long and complicated program, that touches table `b` multiple times, maybe >>> scattered around different methods. If he modifies his program by inserting >>> in one place >>> >>> b.cache() >>> >>> This implicitly alters the semantic and behaviour of his code all over >>> the place, maybe in a ways that might cause problems. For example what if >>> underlying data is changing? >>> >>> Having invisible side effects is also not very clean, for example think >>> about something like this (but more complicated): >>> >>> Table b = ...; >>> >>> If (some_condition) { >>> processTable1(b) >>> } >>> else { >>> processTable2(b) >>> } >>> >>> // do more stuff with b >>> >>> And user adds `b.cache()` call to only one of the `processTable1` or >>> `processTable2` methods. >>> >>> On the other hand >>> >>> Table materialisedB = b.materialize() >>> >>> Avoids (at least some of) the side effect issues and forces user to >>> explicitly use `materialisedB` where it’s appropriate and forces user to >>> think what does it actually mean. And if something doesn’t work in the end >>> for the user, he will know what has he changed instead of blaming Flink for >>> some “magic” underneath. In the above example, after materialising b in >>> only one of the methods, he should/would realise about the issue when >>> handling the return value `MaterializedTable` of that method. >>> >>> I guess it comes down to personal preferences if you like things to be >>> implicit or not. The more power is the user, probably the more likely he is >>> to like/understand implicit behaviour. And we as Table API designers are >>> the most power users out there, so I would proceed with caution (so that we >>> do not end up in the crazy perl realm with it’s lovely implicit method >>> arguments ;) <https://stackoverflow.com/a/14922656/8149051>) >>> >>>> Table API to also support non-relational processing cases, cache() >>> might be slightly better. >>> >>> I think even such extended Table API could benefit from sticking to/being >>> consistent with SQL where both SQL and Table API are basically the same. >>> >>> One more thing. `MaterializedTable materialize()` could be more >>> powerful/flexible allowing the user to operate both on materialised and not >>> materialised view at the same time for whatever reasons (underlying data >>> changing/better optimisation opportunities after pushing down more filters >>> etc). For example: >>> >>> Table b = …; >>> >>> MaterlizedTable mb = b.materialize(); >>> >>> Val min = mb.min(); >>> Val max = mb.max(); >>> >>> Val user42 = b.filter(‘userId = 42); >>> >>> Could be more efficient compared to `b.cache()` if `filter(‘userId = >>> 42);` allows for much more aggressive optimisations. >>> >>> Piotrek >>> >>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fhue...@gmail.com> wrote: >>>> >>>> I'm not suggesting to add support for Ignite. This was just an example. >>>> Plasma and Arrow sound interesting, too. >>>> For the sake of this proposal, it would be up to the user to implement a >>>> TableFactory and corresponding TableSource / TableSink classes to >>> persist >>>> and read the data. >>>> >>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier < >>>> pomperma...@okkam.it>: >>>> >>>>> What about to add also Apache Plasma + Arrow as an alternative to >>> Apache >>>>> Ignite? >>>>> [1] >>>>> >>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/ >>>>> >>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fhue...@gmail.com> >>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Thanks for the proposal! >>>>>> >>>>>> To summarize, you propose a new method Table.cache(): Table that will >>>>>> trigger a job and write the result into some temporary storage as >>> defined >>>>>> by a TableFactory. >>>>>> The cache() call blocks while the job is running and eventually >>> returns a >>>>>> Table object that represents a scan of the temporary table. >>>>>> When the "session" is closed (closing to be defined?), the temporary >>>>> tables >>>>>> are all dropped. >>>>>> >>>>>> I think this behavior makes sense and is a good first step towards >>> more >>>>>> interactive workloads. >>>>>> However, its performance suffers from writing to and reading from >>>>> external >>>>>> systems. >>>>>> I think this is OK for now. Changes that would significantly improve >>> the >>>>>> situation (i.e., pinning data in-memory across jobs) would have large >>>>>> impacts on many components of Flink. >>>>>> Users could use in-memory filesystems or storage grids (Apache >>> Ignite) to >>>>>> mitigate some of the performance effects. >>>>>> >>>>>> Best, Fabian >>>>>> >>>>>> >>>>>> >>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin < >>>>>> becket....@gmail.com >>>>>>> : >>>>>> >>>>>>> Thanks for the explanation, Piotrek. >>>>>>> >>>>>>> Is there any extra thing user can do on a MaterializedTable that they >>>>>>> cannot do on a Table? After users call *table.cache(), *users can >>> just >>>>>> use >>>>>>> that table and do anything that is supported on a Table, including >>> SQL. >>>>>>> >>>>>>> Naming wise, either cache() or materialize() sounds fine to me. >>> cache() >>>>>> is >>>>>>> a bit more general than materialize(). Given that we are enhancing >>> the >>>>>>> Table API to also support non-relational processing cases, cache() >>>>> might >>>>>> be >>>>>>> slightly better. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Jiangjie (Becket) Qin >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski < >>>>> pi...@data-artisans.com >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Becket, >>>>>>>> >>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing >>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want to >>>>>> provide >>>>>>> an >>>>>>>> alternate way of writing the data. >>>>>>>> >>>>>>>> Now that I hopefully understand the proposal, maybe we could rename >>>>>>>> `cache()` to >>>>>>>> >>>>>>>> void materialize() >>>>>>>> >>>>>>>> or going step further >>>>>>>> >>>>>>>> MaterializedTable materialize() >>>>>>>> MaterializedTable createMaterializedView() >>>>>>>> >>>>>>>> ? >>>>>>>> >>>>>>>> The second option with returning a handle I think is more flexible >>>>> and >>>>>>>> could provide features such as “refresh”/“delete” or generally >>>>> speaking >>>>>>>> manage the the view. In the future we could also think about adding >>>>>> hooks >>>>>>>> to automatically refresh view etc. It is also more explicit - >>>>>>>> materialization returning a new table handle will not have the same >>>>>>>> implicit side effects as adding a simple line of code like >>>>> `b.cache()` >>>>>>>> would have. >>>>>>>> >>>>>>>> It would also be more SQL like, making it more intuitive for users >>>>>>> already >>>>>>>> familiar with the SQL. >>>>>>>> >>>>>>>> Piotrek >>>>>>>> >>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <becket....@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Hi Piotrek, >>>>>>>>> >>>>>>>>> For the cache() method itself, yes, it is equivalent to creating a >>>>>>>> BUILT-IN >>>>>>>>> materialized view with a lifecycle. That functionality is missing >>>>>>> today, >>>>>>>>> though. Not sure if I understand your question. Do you mean we >>>>>> already >>>>>>>> have >>>>>>>>> the functionality and just need a syntax sugar? >>>>>>>>> >>>>>>>>> What's more interesting in the proposal is do we want to stop at >>>>>>> creating >>>>>>>>> the materialized view? Or do we want to extend that in the future >>>>> to >>>>>> a >>>>>>>> more >>>>>>>>> useful unified data store distributed with Flink? And do we want to >>>>>>> have >>>>>>>> a >>>>>>>>> mechanism allow more flexible user job pattern with their own user >>>>>>>> defined >>>>>>>>> services. These considerations are much more architectural. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>> >>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski < >>>>>>> pi...@data-artisans.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t the >>>>>>>>>> `cache()` call an equivalent of writing data to a sink and later >>>>>>> reading >>>>>>>>>> from it? Where this sink has a limited live scope/live time? And >>>>> the >>>>>>>> sink >>>>>>>>>> could be implemented as in memory or a file sink? >>>>>>>>>> >>>>>>>>>> If so, what’s the problem with creating a materialised view from a >>>>>>> table >>>>>>>>>> “b” (from your document’s example) and reusing this materialised >>>>>> view >>>>>>>>>> later? Maybe we are lacking mechanisms to clean up materialised >>>>>> views >>>>>>>> (for >>>>>>>>>> example when current session finishes)? Maybe we need some >>>>> syntactic >>>>>>>> sugar >>>>>>>>>> on top of it? >>>>>>>>>> >>>>>>>>>> Piotrek >>>>>>>>>> >>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket....@gmail.com> >>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thanks for the suggestion, Jincheng. >>>>>>>>>>> >>>>>>>>>>> Yes, I think it makes sense to have a persist() with >>>>>>> lifecycle/defined >>>>>>>>>>> scope. I just added a section in the future work for this. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun < >>>>>>> sunjincheng...@gmail.com >>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Jiangjie, >>>>>>>>>>>> >>>>>>>>>>>> Thank you for the explanation about the name of `cache()`, I >>>>>>>> understand >>>>>>>>>> why >>>>>>>>>>>> you designed this way! >>>>>>>>>>>> >>>>>>>>>>>> Another idea is whether we can specify a lifecycle for data >>>>>>>> persistence? >>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user is >>>>> not >>>>>>>>>> worried >>>>>>>>>>>> about data loss, and will clearly specify the time range for >>>>>> keeping >>>>>>>>>> time. >>>>>>>>>>>> At the same time, if we want to expand, we can also share in a >>>>>>> certain >>>>>>>>>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I >>>>> am >>>>>>> not >>>>>>>>>> sure, >>>>>>>>>>>> just an immature suggestion, for reference only! >>>>>>>>>>>> >>>>>>>>>>>> Bests, >>>>>>>>>>>> Jincheng >>>>>>>>>>>> >>>>>>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月23日周五 下午1:33写道: >>>>>>>>>>>> >>>>>>>>>>>>> Re: Jincheng, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(), >>>>>>>> personally I >>>>>>>>>>>>> find cache() to be more accurately describing the behavior, >>>>> i.e. >>>>>>> the >>>>>>>>>>>> Table >>>>>>>>>>>>> is cached for the session, but will be deleted after the >>>>> session >>>>>> is >>>>>>>>>>>> closed. >>>>>>>>>>>>> persist() seems a little misleading as people might think the >>>>>> table >>>>>>>>>> will >>>>>>>>>>>>> still be there even after the session is gone. >>>>>>>>>>>>> >>>>>>>>>>>>> Great point about mixing the batch and stream processing in the >>>>>>> same >>>>>>>>>> job. >>>>>>>>>>>>> We should absolutely move towards that goal. I imagine that >>>>> would >>>>>>> be >>>>>>>> a >>>>>>>>>>>> huge >>>>>>>>>>>>> change across the board, including sources, operators and >>>>>>>>>> optimizations, >>>>>>>>>>>> to >>>>>>>>>>>>> name some. Likely we will need several separate in-depth >>>>>>> discussions. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui < >>>>> xingc...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both >>>>>>>> orthogonal >>>>>>>>>>>> to >>>>>>>>>>>>>> the cache problem. Essentially, this may be the first time we >>>>>> plan >>>>>>>> to >>>>>>>>>>>>>> introduce another storage mechanism other than the state. >>>>> Maybe >>>>>>> it’s >>>>>>>>>>>>> better >>>>>>>>>>>>>> to first draw a big picture and then concentrate on a specific >>>>>>> part? >>>>>>>>>>>>>> >>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the underlying >>>>>>>>>> service. >>>>>>>>>>>>>> This seems to be quite a major change to the existing >>>>> codebase. >>>>>> As >>>>>>>> you >>>>>>>>>>>>>> claimed, the service should be extendible to support other >>>>>>>> components >>>>>>>>>>>> and >>>>>>>>>>>>>> we’d better discussed it in another thread. >>>>>>>>>>>>>> >>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive Table >>>>>> API, >>>>>>> in >>>>>>>>>>>> case >>>>>>>>>>>>>> of a general and flexible enough service mechanism. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Xingcan >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang < >>>>>> xiaow...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up is not >>>>>> very >>>>>>>>>>>>>> reliable. >>>>>>>>>>>>>>> There is no guarantee that it will be executed successfully. >>>>> We >>>>>>> may >>>>>>>>>>>>> risk >>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to have an >>>>>>>>>>>> association >>>>>>>>>>>>>>> between temp table and session id. So we can always clean up >>>>>> temp >>>>>>>>>>>>> tables >>>>>>>>>>>>>>> which are no longer associated with any active sessions. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> Xiaowei >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun < >>>>>>>>>>>>> sunjincheng...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for initiating this great proposal! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Interactive Programming is very useful and user friendly in >>>>>> case >>>>>>>> of >>>>>>>>>>>>> your >>>>>>>>>>>>>>>> examples. >>>>>>>>>>>>>>>> Moreover, especially when a business has to be executed in >>>>>>> several >>>>>>>>>>>>>> stages >>>>>>>>>>>>>>>> with dependencies,such as the pipeline of Flink ML, in order >>>>>> to >>>>>>>>>>>>> utilize >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> intermediate calculation results we have to submit a job by >>>>>>>>>>>>>> env.execute(). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> About the `cache()` , I think is better to named >>>>> `persist()`, >>>>>>> And >>>>>>>>>>>> The >>>>>>>>>>>>>>>> Flink framework determines whether we internally cache in >>>>>> memory >>>>>>>> or >>>>>>>>>>>>>> persist >>>>>>>>>>>>>>>> to the storage system,Maybe save the data into state backend >>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support for >>>>>>>> streaming >>>>>>>>>>>>> and >>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit in >>>>>>>>>>>> "Interactive >>>>>>>>>>>>>>>> Programming", I am looking forward to your JIRAs and FLIP! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> Jincheng >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月20日周二 下午9:56写道: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a >>>>>>> promising >>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various aspects, >>>>>>>>>>>> including >>>>>>>>>>>>>>>>> functionality and ease of use among others. One of the >>>>>>> scenarios >>>>>>>>>>>>> where >>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming. To >>>>>> explain >>>>>>>> the >>>>>>>>>>>>>>>> issues >>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put >>>>>> together >>>>>>>> the >>>>>>>>>>>>>>>>> following document with our proposal. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Feedback and comments are very welcome! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >>>