Re: Table Names in Spark Catalog

Stuart Macdonald Wed, 22 Aug 2018 01:27:02 -0700

Hi Val, yes that's correct. I'd be happy to make the change to have the
database reference the schema if Nikolay agrees. (I'll first need to do a
bit of research into how to obtain the list of all available schemata...)


Thanks,
Stuart.

On Tue, Aug 21, 2018 at 9:43 PM, Valentin Kulichenko <
valentin.kuliche...@gmail.com> wrote:

> Stuart,
>
> Thanks for pointing this out, I was not aware that we use Spark database
> concept this way. Actually, this confuses me a lot. As far as I understand,
> catalog is created in the scope of a particular IgniteSparkSession, which
> in turn is assigned to a particular IgniteContext and therefore single
> Ignite client. If that's the case, I don't think it should be aware of
> other Ignite clients that are connected to other clusters. This doesn't
> look like correct behavior to me, not to mention that with this approach
> having multiple databases would be a very rare case. I believe we should
> get rid of this logic and use Ignite schema name as database name in
> Spark's catalog.
>
> Nikolay, what do you think?
>
> -Val
>
> On Tue, Aug 21, 2018 at 8:17 AM Stuart Macdonald <stu...@stuwee.org>
> wrote:
>
>> Nikolay, Val,
>>
>> The JDBC Spark datasource[1] -- as far as I can tell -- has no
>> ExternalCatalog implementation, it just uses the database specified in the
>> JDBC URL. So I don't believe there is any way to call listTables() or
>> listDatabases() for JDBC provider.
>>
>> The Hive ExternalCatalog[2] makes the distinction between database and
>> table using the actual database and table mechanisms built into the
>> catalog, which is fine because Hive has the clear distinction and
>> hierarchy
>> of databases and tables.
>>
>> *However* Ignite already uses the "database" concept in the Ignite
>>
>> ExternalCatalog[3] to mean the name of an Ignite instance. So in Ignite we
>> have instances containing schemas containing tables, and Spark only has
>> the
>> concept of databases and tables so it seems like either we ignore one of
>> the three Ignite concepts or combine two of them into database or table.
>> The current implementation in the pull request combines Ignite schema and
>> table attributes into the Spark table attribute.
>>
>> Stuart.
>>
>> [1]
>> https://github.com/apache/spark/blob/master/sql/core/
>> src/main/scala/org/apache/spark/sql/execution/
>> datasources/jdbc/JDBCRelation.scala
>> [2]
>> https://github.com/apache/spark/blob/master/sql/hive/
>> src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
>> [3]
>> https://github.com/apache/ignite/blob/master/modules/
>> spark/src/main/scala/org/apache/spark/sql/ignite/
>> IgniteExternalCatalog.scala
>>
>> On Tue, Aug 21, 2018 at 9:31 AM, Nikolay Izhikov <nizhi...@apache.org>
>> wrote:
>>
>> > Hello, Stuart.
>> >
>> > Can you do some research and find out how schema is handled in Data
>> Frames
>> > for a regular RDBMS such as Oracle, MySQL, etc?
>> >
>> > В Пн, 20/08/2018 в 15:37 -0700, Valentin Kulichenko пишет:
>> > > Stuart, Nikolay,
>> > >
>> > > I see that the 'Table' class (returned by listTables method) has a
>> > 'database' field. Can we use this one to report schema name?
>> > >
>> > > In any case, I think we should look into how this is done in data
>> source
>> > implementations for other databases. Any relational database has a
>> notion
>> > of schema, and I'm sure Spark integrations take this into account
>> somehow.
>> > >
>> > > -Val
>> > >
>> > > On Mon, Aug 20, 2018 at 6:12 AM Nikolay Izhikov <nizhi...@apache.org>
>> > wrote:
>> > > > Hello, Stuart.
>> > > >
>> > > > Personally, I think we should change current tables naming and
>> return
>> > table in form of `schema.table`.
>> > > >
>> > > > Valentin, could you share your opinion?
>> > > >
>> > > >
>> > > > В Пн, 20/08/2018 в 10:04 +0100, Stuart Macdonald пишет:
>> > > > > Igniters,
>> > > > >
>> > > > > While reviewing the changes for IGNITE-9228 [1,2], Nikolay and I
>> are
>> > > > > discussing whether to introduce a change which may impact
>> backwards
>> > > > > compatibility; Nikolay suggested we take the discussion to this
>> list.
>> > > > >
>> > > > > Ignite implements a custom Spark catalog which provides an API by
>> > which
>> > > > > Spark users can list the tables which are available in Ignite
>> which
>> > can be
>> > > > > queried via Spark SQL. Currently that table name list includes
>> just
>> > the
>> > > > > names of the tables, but IGNITE-9228 is introducing a change which
>> > allows
>> > > > > optional prefixing of schema names to table names to disambiguate
>> > multiple
>> > > > > tables with the same name in different schemas. For the "list
>> > tables" API
>> > > > > we therefore have two options:
>> > > > >
>> > > > > 1. List the tables using both their table names and
>> schema-qualified
>> > table
>> > > > > names (eg. [ "myTable", "mySchema.myTable" ]) even though they are
>> > the same
>> > > > > underlying table. This retains backwards compatibility with users
>> who
>> > > > > expect "myTable" to appear in the catalog.
>> > > > > 2. List the tables using only their schema-qualified names. This
>> > eliminates
>> > > > > duplication of names in the catalog but will potentially break
>> > > > > compatibility with users who expect the table name in the catalog.
>> > > > >
>> > > > > With either option we will allow for  Spark SQL SELECT statements
>> to
>> > use
>> > > > > either table name or schema-qualified table names, this change
>> would
>> > purely
>> > > > > impact the API which is used to list available tables.
>> > > > >
>> > > > > Any opinions would be welcome.
>> > > > >
>> > > > > Thanks,
>> > > > > Stuart.
>> > > > >
>> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9228
>> > > > > [2] https://github.com/apache/ignite/pull/4551
>> >
>>
>

Re: Table Names in Spark Catalog

Reply via email to