Re: [DISCUSS] FLIP-92: JDBC catalog and Postgres catalog

Bowen Li Wed, 22 Jan 2020 11:07:15 -0800

Hi Flavio,

First, this is a generic question on how flink-jdbc is set up, not specific
to jdbc catalog, thus is better to be on its own thread.


But to just quickly answer your question, you need to see where the
incompatibility is. There may be incompatibility on 1) jdbc drivers and 2)
the databases. 1) is fairly stable and back-compatible. 2) normally has
things to do with your queries, not the driver.



On Tue, Jan 21, 2020 at 3:21 PM Flavio Pompermaier <[email protected]>
wrote:

> Hi all,
> I'm happy to see a lot of interest in easing the integration with JDBC data
> sources. Maybe this could be a rare situation (not in my experience
> however..) but what if I have to connect to the same type of source (e.g.
> Mysql) with 2 incompatible version...? How can I load the 2 (or more)
> connectors jars without causing conflicts?
>
> Il Mar 14 Gen 2020, 23:32 Bowen Li <[email protected]> ha scritto:
>
> > Hi devs,
> >
> > I've updated the wiki according to feedbacks. Please take another look.
> >
> > Thanks!
> >
> >
> > On Fri, Jan 10, 2020 at 2:24 PM Bowen Li <[email protected]> wrote:
> >
> > > Thanks everyone for the prompt feedback. Please see my response below.
> > >
> > > > In Postgress, the TIME/TIMESTAMP WITH TIME ZONE has the
> > > java.time.Instant semantic, and should be mapped to Flink's
> > TIME/TIMESTAMP
> > > WITH LOCAL TIME ZONE
> > >
> > > Zhenghua, you are right that pg's 'timestamp with timezone' should be
> > > translated into flink's 'timestamp with local timezone'. I don't find
> > 'time
> > > with (local) timezone' though, so we may not support that type from pg
> in
> > > Flink.
> > >
> > > > I suggest that the parameters can be completely consistent with the
> > > JDBCTableSource / JDBCTableSink. If you take a look to JDBC api:
> > > "DriverManager.getConnection".
> > > That allow "default db, username, pwd" things optional. They can
> included
> > > in URL. Of course JDBC api also allows establishing connections to
> > > different databases in a db instance. So I think we don't need provide
> a
> > > "base_url", we can just provide a real "url". To be consistent with
> JDBC
> > > api.
> > >
> > > Jingsong, what I'm saying is a builder can be added on demand later if
> > > there's enough user requesting it, and doesn't need to be a core part
> of
> > > the FLIP.
> > >
> > > Besides, unfortunately Postgres doesn't allow changing databases via
> > JDBC.
> > >
> > > JDBC provides different connecting options as you mentioned, but I'd
> like
> > > to keep our design and API simple and having to handle extra parsing
> > logic.
> > > And it doesn't shut the door for what you proposed as a future effort.
> > >
> > > > Since the PostgreSQL does not have catalog but schema under database,
> > > why not mapping the PG-database to Flink catalog and PG-schema to Flink
> > > database
> > >
> > > Danny, because 1) there are frequent use cases where users want to
> switch
> > > databases or referencing objects across databases in a pg instance 2)
> > > schema is an optional namespace layer in pg, it always has a default
> > value
> > > ("public") and can be invisible to users if they'd like to as shown in
> > the
> > > FLIP 3) as you mentioned it is specific to postgres, and I don't feel
> > it's
> > > necessary to map Postgres substantially different than others DBMSs
> with
> > > additional complexity
> > >
> > > >'base_url' configuration: We are following the configuration format
> > > guideline [1] which suggest to use dash (-) instead of underline (_).
> And
> > > I'm a little confused the meaning of "base_url" at the first glance,
> > > another idea is split it into several configurations: 'driver',
> > 'hostname',
> > > 'port'.
> > >
> > > Jark, I agreed we should use "base-url" in yaml config.
> > >
> > > I'm not sure about having hostname and port separately because you can
> > > specify multiple hosts with ports in jdbc, like
> > > "jdbc:dbms/host1:port1,host2:port2/", for connection failovers.
> > Separating
> > > them would make configurations harder.
> > >
> > > I will add clear doc and example to avoid any possible confusion.
> > >
> > > > 'default-database' is optional, then which database will be used or
> > what
> > > is the behavior when the default database is not selected.
> > >
> > > This should be DBMS specific. For postgres, it will be the <access id>
> > > database.
> > >
> > >
> > > On Thu, Jan 9, 2020 at 9:48 PM Zhenghua Gao <[email protected]> wrote:
> > >
> > >> Hi Bowen, Thanks for driving this.
> > >> I think it would be very convenience to use tables in external DBs
> with
> > >> JDBC Catalog.
> > >>
> > >> I have one concern about "Flink-Postgres Data Type Mapping" part:
> > >>
> > >> In Postgress, the TIME/TIMESTAMP WITH TIME ZONE has the
> > java.time.Instant
> > >> semantic,
> > >> and should be mapped to Flink's TIME/TIMESTAMP WITH LOCAL TIME ZONE
> > >>
> > >> *Best Regards,*
> > >> *Zhenghua Gao*
> > >>
> > >>
> > >> On Fri, Jan 10, 2020 at 11:09 AM Jingsong Li <[email protected]>
> > >> wrote:
> > >>
> > >> > Hi Bowen, thanks for reply and updating.
> > >> >
> > >> > > I don't see much value in providing a builder for jdbc catalogs,
> as
> > >> they
> > >> > only have 4 or 5 required params, no optional ones. I prefer users
> > just
> > >> > provide a base url without default db, usrname, pwd so we don't need
> > to
> > >> > parse url all around, as I mentioned jdbc catalog may need to
> > establish
> > >> > connections to different databases in a db instance,
> > >> >
> > >> > I suggest that the parameters can be completely consistent with the
> > >> > JDBCTableSource / JDBCTableSink.
> > >> > If you take a look to JDBC api: "DriverManager.getConnection".
> > >> > That allow "default db, username, pwd" things optional. They can
> > >> included
> > >> > in URL. Of course JDBC api also allows establishing connections to
> > >> > different databases in a db instance.
> > >> > So I think we don't need provide a "base_url", we can just provide a
> > >> real
> > >> > "url".
> > >> > To be consistent with JDBC api.
> > >> >
> > >> > Best,
> > >> > Jingsong Lee
> > >> >
> > >> > On Fri, Jan 10, 2020 at 10:34 AM Jark Wu <[email protected]> wrote:
> > >> >
> > >> > > Thanks Bowen for the reply,
> > >> > >
> > >> > > A user-facing JDBCCatalog and 'catalog.type' = 'jdbc'  sounds good
> > to
> > >> me.
> > >> > >
> > >> > > I have some other minor comments when I went through the updated
> > >> > > documentation:
> > >> > >
> > >> > > 1) 'base_url' configuration: We are following the configuration
> > format
> > >> > > guideline [1] which suggest to use dash (-) instead of underline
> > (_).
> > >> > >      And I'm a little confused the meaning of "base_url" at the
> > first
> > >> > > glance, another idea is split it into several configurations:
> > >> 'driver',
> > >> > > 'hostname', 'port'.
> > >> > >
> > >> > > 2) 'default-database' is optional, then which database will be
> used
> > or
> > >> > what
> > >> > > is the behavior when the default database is not selected.
> > >> > >
> > >> > > 3) a builder for jdbc catalogs: I agree with Jingsong to provide a
> > >> > builder.
> > >> > > Because there is optional configuration here (the default
> database),
> > >> > >    and providind Builder as the API will be easier for evolution,
> > I'm
> > >> not
> > >> > > sure we won't add/modify parameters in the future.
> > >> > >
> > >> > > [1]:
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> https://flink.apache.org/contributing/code-style-and-quality-components.html#configuration-changes
> > >> > >
> > >> > > On Fri, 10 Jan 2020 at 04:52, Bowen Li <[email protected]>
> wrote:
> > >> > >
> > >> > > > Hi Jark and Jingsong,
> > >> > > >
> > >> > > > Thanks for your review. Please see my reply in line.
> > >> > > >
> > >> > > > > why introducing a `PostgresJDBCCatalog`, not a generic
> > >> `JDBCCatalog`
> > >> > > > (catalog.type = 'postgres' vs 'jdbc') ?
> > >> > > >
> > >> > > > Thanks for the reminding and I looked at JDBCDialect. A generic,
> > >> > > > user-facing JDBCCatalog with catalog.type = jdbc and find
> specific
> > >> db
> > >> > > > implementations (pg v.s. mysql v.s. ...) is more aligned with
> how
> > >> jdbc
> > >> > > > sink/source is handled, indeed. However, the catalogs would also
> > >> need
> > >> > to
> > >> > > > execute the query and parse query results in a db-dependent way.
> > >> E.g.
> > >> > > jdbc
> > >> > > > catalog needs to establish connections to different databases
> > >> within a
> > >> > db
> > >> > > > instance on demand. So just having JDBCDialect won't be enough.
> > >> > > >
> > >> > > > I think we can do the following:
> > >> > > >   - provide a user-facing JDBCCatalog, composing a db-specific
> > impl
> > >> > like
> > >> > > > PostgresJDBCCatalog and MySQLJDBCCatalog. Users still specify
> > >> "jdbc" as
> > >> > > > type in both Table API and SQL CLI, internally it will create a
> > >> > > db-specific
> > >> > > > impl depending on jdbc base url.
> > >> > > >   - some statements can reside in JDBCDialect. Query execution
> and
> > >> > result
> > >> > > > parsing logic would be located in db-specific impls.
> > >> > > >
> > >> > > > - We can provide a Builder for Catalog, In my opinion,
> > >> defaultDatabase,
> > >> > > > username, pwd can be included in JDBC DB url.
> > >> > > >
> > >> > > > I don't see much value in providing a builder for jdbc catalogs,
> > as
> > >> > they
> > >> > > > only have 4 or 5 required params, no optional ones. I prefer
> users
> > >> just
> > >> > > > provide a base url without default db, usrname, pwd so we don't
> > >> need to
> > >> > > > parse url all around, as I mentioned jdbc catalog may need to
> > >> establish
> > >> > > > connections to different databases in a db instance,
> > >> > > >
> > >> > > > - About timestamp and time, write down the specific Flink
> > precision
> > >> of
> > >> > > > Postgres?
> > >> > > >
> > >> > > > I've documented that. It's 0-6
> > >> > > >
> > >> > > > - I think there is a part missing in your document, that is how
> to
> > >> use
> > >> > > this
> > >> > > > catalog. If you can write a complete example, I think it will be
> > >> much
> > >> > > > clearer.
> > >> > > >
> > >> > > > I added some examples in both table api and SQL Cli. It will be
> no
> > >> > > > different from existing catalogs.
> > >> > > >
> > >> > > > - So a thing is what TableFactory will this catalog use? For
> > >> example,
> > >> > > > JDBCTableSourceSinkFactory has different parameters for source
> or
> > >> sink?
> > >> > > How
> > >> > > > do you think about it?
> > >> > > >
> > >> > > > This catalog will directly call JDBCTableSourceSinkFactory
> without
> > >> > going
> > >> > > > thru service discovery because we are sure it's a jdbc table. I
> > >> added
> > >> > it
> > >> > > to
> > >> > > > the doc.
> > >> > > >
> > >> > > > For the different params besides schema, as we discussed
> offline,
> > >> > > > unfortunately we can't do anything right now until Flink DDL/DML
> > are
> > >> > able
> > >> > > > to distinguish 3 types of params - external data's metada,
> > >> source/sink
> > >> > > > runtime params, and Flink semantics params. The latter two can't
> > be
> > >> > > > provided by catalogs. The problem is actually general to all
> > >> catalogs,
> > >> > > not
> > >> > > > just JDBCCatalog. I'm pushing for such an effort to solve it. At
> > >> this
> > >> > > > moment we can only use some default params for some cases, and
> the
> > >> > other
> > >> > > > cases cannot take advantage of the JDBC catalog and users still
> > >> have to
> > >> > > > write DDL manually.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Bowen
> > >> > > >
> > >> > > > On Wed, Jan 8, 2020 at 7:46 PM Jingsong Li <
> > [email protected]>
> > >> > > wrote:
> > >> > > >
> > >> > > > > Thanks Bowen for driving this,
> > >> > > > >
> > >> > > > > +1 for this, The DDL schema definition is a headache for
> users,
> > >> and
> > >> > > > catalog
> > >> > > > > is a solution to this problem.
> > >> > > > >
> > >> > > > > I have some questions and suggestions:
> > >> > > > >
> > >> > > > > - We can provide a Builder for Catalog, In my opinion,
> > >> > defaultDatabase,
> > >> > > > > username, pwd can be included in JDBC DB url.
> > >> > > > >
> > >> > > > > - About timestamp and time, write down the specific Flink
> > >> precision
> > >> > of
> > >> > > > > Postgres?
> > >> > > > >
> > >> > > > > - I think there is a part missing in your document, that is
> how
> > to
> > >> > use
> > >> > > > this
> > >> > > > > catalog. If you can write a complete example, I think it will
> be
> > >> much
> > >> > > > > clearer.
> > >> > > > >
> > >> > > > > - So a thing is what TableFactory will this catalog use? For
> > >> example,
> > >> > > > > JDBCTableSourceSinkFactory has different parameters for source
> > or
> > >> > sink?
> > >> > > > How
> > >> > > > > do you think about it?
> > >> > > > >
> > >> > > > > Best,
> > >> > > > > Jingsong Lee
> > >> > > > >
> > >> > > > > On Thu, Jan 9, 2020 at 11:33 AM Jark Wu <[email protected]>
> > wrote:
> > >> > > > >
> > >> > > > > > Thanks Bowen for driving this.
> > >> > > > > >
> > >> > > > > > +1 to this feature.
> > >> > > > > >
> > >> > > > > > My concern is that why introducing a `PostgresJDBCCatalog`,
> > not
> > >> a
> > >> > > > generic
> > >> > > > > > `JDBCCatalog` (catalog.type = 'postgres' vs 'jdbc') ?
> > >> > > > > > From my understanding, JDBC catalog is similar to JDBC
> > >> source/sink.
> > >> > > For
> > >> > > > > > JDBC source/sink, we have a generic
> > >> > > > > > implementation for JDBC and delegate operations to
> > JDBCDialect.
> > >> > > > Different
> > >> > > > > > driver may have different implementation of
> > >> > > > > > JDBCDialect, e.g `quoteIdentifier()`.
> > >> > > > > >
> > >> > > > > > For JDBC catalog, I guess maybe we can do it in the same
> way,
> > >> i.e.
> > >> > a
> > >> > > > > > generic JDBCCatalog implementation and delegate
> > >> > > > > > operations to JDBCDialect, and we will have
> `listDataBase()`,
> > >> > > > > > `listTables()` interfaces in JDBCDialect. The benefit is
> that:
> > >> > > > > > 0) reuse the existing `JDBCDialect`, I guess JDBCCatalog
> also
> > >> need
> > >> > to
> > >> > > > > quote
> > >> > > > > > identifiers.
> > >> > > > > > 1) we can easily to support a new database catalog (e.g.
> > mysql)
> > >> by
> > >> > > > > > implementing new dialects (e.g. MySQLDialect).
> > >> > > > > > 2) this can keep the same behavior as JDBC source/sink, i.e.
> > >> > > > > > connector.type=jdbc, catalog.type=jdbc
> > >> > > > > >
> > >> > > > > > Best,
> > >> > > > > > Jark
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Thu, 9 Jan 2020 at 08:33, Bowen Li <[email protected]>
> > >> wrote:
> > >> > > > > >
> > >> > > > > > > Hi dev,
> > >> > > > > > >
> > >> > > > > > > I'd like to kick off a discussion on adding JDBC catalogs,
> > >> > > > specifically
> > >> > > > > > > Postgres catalog in Flink [1].
> > >> > > > > > >
> > >> > > > > > > Currently users have to manually create schemas in Flink
> > >> > > source/sink
> > >> > > > > > > mirroring tables in their relational databases in use
> cases
> > >> like
> > >> > > JDBC
> > >> > > > > > > read/write and consuming CDC. Many users have complaint
> > about
> > >> the
> > >> > > > > > > unnecessary, redundant, manual work. Any mismatch can lead
> > to
> > >> a
> > >> > > > failing
> > >> > > > > > > Flink job at runtime instead of compile time. All these
> have
> > >> been
> > >> > > > quite
> > >> > > > > > > unpleasant, resulting in a broken user experience.
> > >> > > > > > >
> > >> > > > > > > We want to provide a JDBC catalog interface and a Postgres
> > >> > > > > implementation
> > >> > > > > > > for Flink as a start to connect to all kinds of relational
> > >> > > databases,
> > >> > > > > > > enabling Flink SQL to 1) retrieve table schema
> automatically
> > >> > > without
> > >> > > > > > > requiring user writes duped DDL 2) check at compile time
> for
> > >> > schema
> > >> > > > > > errors.
> > >> > > > > > > It will greatly streamline user experiences when using
> Flink
> > >> to
> > >> > > deal
> > >> > > > > with
> > >> > > > > > > popular relational databases like Postgres, MySQL,
> MariaDB,
> > >> AWS
> > >> > > > Aurora,
> > >> > > > > > > etc.
> > >> > > > > > >
> > >> > > > > > > Note that the problem and solution are actually very
> general
> > >> to
> > >> > > Flink
> > >> > > > > > when
> > >> > > > > > > connecting to all kinds of external systems. We just focus
> > on
> > >> > > solving
> > >> > > > > > that
> > >> > > > > > > for relational databases in this FLIP.
> > >> > > > > > >
> > >> > > > > > > Thanks,
> > >> > > > > > > Bowen
> > >> > > > > > >
> > >> > > > > > > [1]
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-92%3A+JDBC+catalog+and+Postgres+catalog
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Best, Jingsong Lee
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Best, Jingsong Lee
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] FLIP-92: JDBC catalog and Postgres catalog

Reply via email to