Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

Jark Wu Wed, 09 Oct 2019 23:11:34 -0700

Hi Xuefu,

If there is only one instance per type, then what's the "name" used for?
Could we remove it and only keep "type" or "kind" to identify modules?


Best,
Jark

On Thu, 10 Oct 2019 at 11:21, Xuefu Z <[email protected]> wrote:

> Jark has a good point. However, I think validation logic can put in place
> to restrict one instance per type. Maybe the doc needs to be specific on
> this.
>
> Thanks,
> Xuefu
>
> On Wed, Oct 9, 2019 at 7:41 PM Jark Wu <[email protected]> wrote:
>
> > Thanks Bowen for the updating.
> >
> > I have some different opinions on the change.
> > IIUC, in the previous design, the "name" is also the "id" or "type" to
> > identify which module to load. Which means we can only load one instance
> of
> > a module.
> > In the new design, the "name" is just an alias to the module instance,
> the
> > "kind" is used to identify modules. Which means we can load different
> > instances of a module.
> > However, what's the "name" or alias used for? Do we need to support
> loading
> > different instances of a module? From my point of view, it brings more
> > complexity and confusion.
> > For example, if we load a "hive121" which uses HiveModule with version
> > 1.2.1 and load a "hive234" which uses HiveModule with version 2.3.4, then
> > how to solve the class conflict problem?
> >
> > IMO, a module can only be load once in a session, so "name" maybe
> useless.
> > So my proposal is similar to the previous one, but only change "name" to
> > "kind".
> >
> >    SQL:
> >          LOAD MODULE "kind" [WITH (properties)];
> >          UNLOAD MODULE "kind";
> >     Table:
> >          tEnv.loadModule("kind" [, properties]);
> >          tEnv.unloadModule("kind");
> >
> > What do you think?
> >
> >
> > Best,
> > Jark
> >
> >
> >
> >
> >
> > On Wed, 9 Oct 2019 at 20:38, Bowen Li <[email protected]> wrote:
> >
> > > Thanks everyone for your review.
> > >
> > > After discussing with Timo and Dawid offline, as well as incorporating
> > > feedback from Xuefu and Jark on mailing list, I decided to make a few
> > > critical changes to the proposal.
> > >
> > > - renamed the keyword "type" to "kind". The community has plan to have
> > > "type" keyword in yaml/descriptor refer to data types exclusively in
> the
> > > near future. We should cater to that change in our design
> > > - allowed specifying names for modules to simplify and unify module
> > > loading/unloading syntax between programming and SQL. Here're the
> > proposed
> > > changes:
> > >     SQL:
> > >          LOAD MODULE "name" WITH ("kind"="xxx" [, (properties)])
> > >          UNLOAD MODULE "name";
> > >     Table:
> > >          tEnv.loadModule("name", new Xxx(properties));
> > >          tEnv.unloadModule("name");
> > >
> > > I have completely updated the google doc [1]. Please take another look,
> > and
> > > let me know if you have any other questions. Thanks!
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/document/d/17CPMpMbPDjvM4selUVEfh_tqUK_oV0TODAUA9dfHakc/edit#
> > >
> > >
> > > On Tue, Oct 8, 2019 at 6:26 AM Jark Wu <[email protected]> wrote:
> > >
> > > > Hi Bowen,
> > > >
> > > > Thanks for the proposal. I have two thoughts:
> > > >
> > > > 1) Regarding to "loadModule", how about
> > > > tableEnv.loadModule("xxx" [, propertiesMap]);
> > > > tableEnv.unloadModule(“xxx”);
> > > >
> > > > This makes the API similar to SQL. IMO, instance of Module is not
> > needed
> > > > and verbose as parameter.
> > > > And this makes it easier to load a simple module without any
> additional
> > > > properties, e.g. tEnv.loadModule("GEO"), tEnv.unloadModule("GEO")
> > > >
> > > > 2) In current design, the module interface only defines function
> > > metadata,
> > > > but no implementations.
> > > > I'm wondering how to call/map the implementations in runtime? Am I
> > > missing
> > > > something?
> > > >
> > > > Besides, I left some minor comments in the doc.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > >
> > > > On Sat, 5 Oct 2019 at 08:42, Xuefu Z <[email protected]> wrote:
> > > >
> > > > > I agree with Timo that the new table APIs need to be consistent.
> I'd
> > go
> > > > > further that an name (or id) is needed for module definition in
> YAML
> > > > file.
> > > > > In the current design, name is skipped and type has binary
> meanings.
> > > > >
> > > > > Thanks,
> > > > > Xuefu
> > > > >
> > > > > On Fri, Oct 4, 2019 at 5:24 AM Timo Walther <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > first, I was also questioning my proposal. But Bowen's proposal
> of
> > > > > > `tEnv.offloadToYaml(<file_path>)` would not work with the current
> > > > design
> > > > > > because we don't know how to serialize a catalog or module into
> > > > > > properties. Currently, there is no converter from instance to
> > > > > > properties. It is a one way conversion. We can add a
> `toProperties`
> > > > > > method to both Catalog and Module class in the future to solve
> > this.
> > > > > > Solving the table environment serializability can be future work.
> > > > > >
> > > > > > However, I find the current proposal for the TableEnvironment
> > methods
> > > > is
> > > > > > contradicting:
> > > > > >
> > > > > > tableEnv.loadModule(new Yyy());
> > > > > > tableEnv.unloadModule(“Xxx”);
> > > > > >
> > > > > > The loading is specified programmatically whereas the unloading
> > > > requires
> > > > > > a string that is not specified in the module itself. But is
> defined
> > > in
> > > > > > the factory according to the current design.
> > > > > >
> > > > > > SQL does it more consistently. There, the name `xxx` is used when
> > > > > > loading and unloading the module:
> > > > > >
> > > > > > LOAD MODULE 'xxx' [WITH ('prop'='myProp', ...)]
> > > > > > UNLOAD MODULE 'xxx’
> > > > > >
> > > > > > How about:
> > > > > >
> > > > > > tableEnv.loadModule("xxx", new Yyy());
> > > > > > tableEnv.unloadModule(“xxx”);
> > > > > >
> > > > > > This would be similar to the catalog interfaces. The name is not
> > part
> > > > of
> > > > > > the instance itself.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Thanks,
> > > > > > Timo
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 01.10.19 21:17, Bowen Li wrote:
> > > > > > > If something like the yaml file is the way to go and achieve
> such
> > > > > > > motivation, we would cover that with current design.
> > > > > > >
> > > > > > > On Tue, Oct 1, 2019 at 12:05 Bowen Li <[email protected]>
> > wrote:
> > > > > > >
> > > > > > >> Hi Timo, Dawid,
> > > > > > >>
> > > > > > >> I've added the suggested SQL and related changes to
> > > TableEnvironment
> > > > > API
> > > > > > >> and other classes to the google doc. Also removed "USE MODULE"
> > and
> > > > its
> > > > > > >> APIs. Will update FLIP wiki once we have a consensus.
> > > > > > >>
> > > > > > >> W.r.t. descriptor approach, my gut feeling is similar to
> > Dawid's.
> > > > > > Besides,
> > > > > > >> I feel yaml file would be a better solution to persist
> > > serializable
> > > > > > state
> > > > > > >> of an environment as the file itself is in serializable format
> > > > > already.
> > > > > > >> Though yaml file only serves SQL CLI at this moment, we may be
> > > able
> > > > to
> > > > > > >> extend its reach to Table API and allow users to load/offload
> a
> > > > > > >> TableEnvironment from/to yaml files, as something like
> > > > > "TableEnvironment
> > > > > > >> tEnv = TableEnvironment.loadFromYaml(<file_path>)" and
> > > > > > >> "tEnv.offloadToYaml(<file_path>)" to restore and persist
> state,
> > > and
> > > > > try
> > > > > > to
> > > > > > >> make yaml file more expressive.
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, Oct 1, 2019 at 6:47 AM Dawid Wysakowicz <
> > > > > [email protected]
> > > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Timo, Bowen,
> > > > > > >>>
> > > > > > >>> Unfortunately I did not have enough time to go through all
> the
> > > > > > >>> suggestions in details so I can not comment on the whole
> FLIP.
> > > > > > >>>
> > > > > > >>> I just wanted to give my opinion on the "descriptor approach
> in
> > > > > > >>> loadModule" part. I am not sure if we need it here. We might
> be
> > > > > > >>> overthinking this a bit. It definitely makes sense for
> objects
> > > like
> > > > > > >>> TableSource/TableSink etc. as they are logical definitions
> that
> > > > > nearly
> > > > > > >>> always have to be persisted in a Catalog. I'm not sure if we
> > > really
> > > > > > need
> > > > > > >>> the same for a whole session. If we need a resume session
> > > feature,
> > > > > the
> > > > > > >>> way to go would probably be to keep the session in memory on
> > the
> > > > > server
> > > > > > >>> side. I fear we will never be able to serialize the whole
> > session
> > > > > > >>> entirely (temporary objects, objects derived from DataStream
> > > etc.)
> > > > > > >>>
> > > > > > >>> I think it is ok to use instances for objects like Catalogs
> or
> > > > > Modules
> > > > > > >>> and have an overlay on top of that that can create instances
> > from
> > > > > > >>> properties.
> > > > > > >>>
> > > > > > >>> Best,
> > > > > > >>>
> > > > > > >>> Dawid
> > > > > > >>>
> > > > > > >>> On 01/10/2019 11:28, Timo Walther wrote:
> > > > > > >>>> Hi Bowen,
> > > > > > >>>>
> > > > > > >>>> thanks for your response.
> > > > > > >>>>
> > > > > > >>>> Re 2) I also don't have a better approach for this issue. It
> > is
> > > > > > >>>> similar to changing the general TableConfig between two
> > > > statements.
> > > > > It
> > > > > > >>>> would be good to add your explanation to the design
> document.
> > > > > > >>>>
> > > > > > >>>> Re 3) It would be interesting to know about which "core"
> > > functions
> > > > > we
> > > > > > >>>> are actually talking about. Also for the overriding built-in
> > > > > functions
> > > > > > >>>> that we discussed in the other FLIP. But I'm fine with
> leaving
> > > it
> > > > to
> > > > > > >>>> the user for now. How about we just introduce loadModule(),
> > > > > > >>>> unloadModule() methods instead of useModules()? This would
> > > ensure
> > > > > that
> > > > > > >>>> users don't forget to add the core module when adding an
> > > > additional
> > > > > > >>>> module and they need to explicitly call
> > "unloadModule('core')".
> > > > > > >>>>
> > > > > > >>>> Re 4) Every table environment feature should also be
> designed
> > > with
> > > > > SQL
> > > > > > >>>> statements in mind to verify the concept. SQL is also more
> > > popular
> > > > > > >>>> that Java/Scala API or YAML file. I would like to add it to
> > 1.10
> > > > for
> > > > > > >>>> marking the feature as complete.
> > > > > > >>>>
> > > > > > >>>> SHOW MODULES -> sounds good to me, we should add a
> > > listModules():
> > > > > > >>>> List<String> method to table environment
> > > > > > >>>>
> > > > > > >>>> LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] --> we
> should
> > > > add a
> > > > > > >>>> loadModule() method to table environment
> > > > > > >>>>
> > > > > > >>>> UNLOAD MODULE 'hive' --> we should add a unloadModule()
> method
> > > to
> > > > > > >>>> table environment
> > > > > > >>>>
> > > > > > >>>> I would not introduce `USE MODULES 'x' 'y' 'z'` for
> simplicity
> > > and
> > > > > > >>>> concise API. Users need to load the module anyway with
> > > properties.
> > > > > > >>>> They can also load them "in order" immediately. CREATE TABLE
> > can
> > > > > also
> > > > > > >>>> not create multiple tables but only one at a time in that
> > order.
> > > > > > >>>>
> > > > > > >>>> One thing that came to my mind, shall we use a descriptor
> > > approach
> > > > > for
> > > > > > >>>> loadModule()? The past has shown that passing instances
> causes
> > > > > > >>>> problems when persisting objects. That's why we also want to
> > get
> > > > rid
> > > > > > >>>> of registerTableSource. I could image that users might want
> to
> > > > > persist
> > > > > > >>>> a table environment's state for later use in the future.
> Even
> > > > though
> > > > > > >>>> this is future work, we should already keep such use cases
> in
> > > mind
> > > > > > >>>> when adding new API methods. What do you think?
> > > > > > >>>>
> > > > > > >>>> Regards,
> > > > > > >>>> Timo
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On 30.09.19 23:17, Bowen Li wrote:
> > > > > > >>>>> Hi Timo,
> > > > > > >>>>>
> > > > > > >>>>> Re 1) I agree. I renamed the title to "Extend Core Table
> > System
> > > > > with
> > > > > > >>>>> Pluggable Modules" and all internal references
> > > > > > >>>>>
> > > > > > >>>>> Re 2) First, I'll rename the API to useModules(). The
> design
> > > > > doesn't
> > > > > > >>>>> forbid
> > > > > > >>>>> users to call useModules() multi times. Objects in modules
> > are
> > > > > loaded
> > > > > > >>> on
> > > > > > >>>>> demand instead of eagerly, so there won't be inconsistency.
> > > Users
> > > > > > >>>>> have to
> > > > > > >>>>> be fully aware of the consequences of resetting modules as
> > that
> > > > > might
> > > > > > >>>>> cause
> > > > > > >>>>> that some objects can not be referenced anymore or
> resolution
> > > > order
> > > > > > >>>>> of some
> > > > > > >>>>> objects changes.
> > > > > > >>>>>
> > > > > > >>>>> Re 3) Yes, we'd leave that to users.
> > > > > > >>>>>
> > > > > > >>>>> Another approach can be to have a non-optional "Core"
> module
> > > for
> > > > > all
> > > > > > >>>>> objects that cannot be overrode like "CAST" and "AS"
> > functions,
> > > > and
> > > > > > >>>>> have an
> > > > > > >>>>> optional "ExtendedCore" module for other replaceable
> built-in
> > > > > > objects.
> > > > > > >>>>> "Core" should be positioned at the 1st in module list all
> the
> > > > time.
> > > > > > >>>>>
> > > > > > >>>>> I'm fine with either solution.
> > > > > > >>>>>
> > > > > > >>>>> Re 4) It may sound like a nice-to-have advanced feature for
> > > 1.10,
> > > > > but
> > > > > > >>> we
> > > > > > >>>>> can surely fully discuss it for the sake of feature
> > > completeness.
> > > > > > >>>>>
> > > > > > >>>>> Unlike other configs, the order of modules would matter in
> > > Flink,
> > > > > and
> > > > > > >>> it
> > > > > > >>>>> implies the LOAD/UNLOAD commands would not be equal in
> > > operation
> > > > > > >>>>> positions.
> > > > > > >>>>> IIUYC, LOAD MODULE 'x' would be interpreted as appending x
> to
> > > the
> > > > > end
> > > > > > >>> of
> > > > > > >>>>> module list, and UNLOAD MODULE 'x' would be interpreted as
> > > > > removing x
> > > > > > >>>>> from
> > > > > > >>>>> any position in the list?
> > > > > > >>>>>
> > > > > > >>>>> I'm thinking of the following list of commands:
> > > > > > >>>>>
> > > > > > >>>>> SHOW MODULES - list modules in order
> > > > > > >>>>> LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] - load and
> > > > append
> > > > > > the
> > > > > > >>>>> module to end of the module list
> > > > > > >>>>> UNLOAD MODULE 'hive' - remove the module from module list,
> > and
> > > > > other
> > > > > > >>>>> modules remain the same relative positions
> > > > > > >>>>> USE MODULES 'x' 'y' 'z' (wondering can parser take "'x' 'y'
> > > > 'z'"?),
> > > > > > >>>>> or USE
> > > > > > >>>>> MODULES 'x,y,z' - to reorder module list completely
> > > > > > >>>>>
> > > > > > >>>
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Xuefu Zhang
> > > > >
> > > > > "In Honey We Trust!"
> > > > >
> > > >
> > >
> >
>
>
> --
> Xuefu Zhang
>
> "In Honey We Trust!"
>

Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

Reply via email to