Re: Optimization of SQL queries from Spark Data Frame to Ignite

Vladimir Ozerov Wed, 29 Nov 2017 03:40:46 -0800

Hi Nikolay,

No, it is not possible to get this info from public API, neither we planned
to expose it. See IGNITE-4509 and commit *fbf0e353* to get better
understanding on how this was implemented.


Vladimir.

On Wed, Nov 29, 2017 at 2:01 PM, Николай Ижиков <nizhikov....@gmail.com>
wrote:

> Hello, Vladimir.
>
> > partition pruning is already implemented in Ignite, so there is no need
> to do this on your own.
>
> Spark work with partitioned data set.
> It is required to provide data partition information to Spark from custom
> Data Source(Ignite).
>
> Can I get information about pruned partitions throw some public API?
> Is there a plan or ticket to implement such API?
>
>
>
> 2017-11-29 10:34 GMT+03:00 Vladimir Ozerov <voze...@gridgain.com>:
>
> > Nikolay,
> >
> > Regarding p3. - partition pruning is already implemented in Ignite, so
> > there is no need to do this on your own.
> >
> > On Wed, Nov 29, 2017 at 3:23 AM, Valentin Kulichenko <
> > valentin.kuliche...@gmail.com> wrote:
> >
> > > Nikolay,
> > >
> > > Custom strategy allows to fully process the AST generated by Spark and
> > > convert it to Ignite SQL, so there will be no execution on Spark side
> at
> > > all. This is what we are trying to achieve here. Basically, one will be
> > > able to use DataFrame API to execute queries directly on Ignite. Does
> it
> > > make sense to you?
> > >
> > > I would recommend you to take a look at MemSQL implementation which
> does
> > > similar stuff: https://github.com/memsql/memsql-spark-connector
> > >
> > > Note that this approach will work only if all relations included in AST
> > are
> > > Ignite tables. Otherwise, strategy should return null so that Spark
> falls
> > > back to its regular mode. Ignite will be used as regular data source in
> > > this case, and probably it's possible to implement some optimizations
> > here
> > > as well. However, I never investigated this and it seems like another
> > > separate discussion.
> > >
> > > -Val
> > >
> > > On Tue, Nov 28, 2017 at 9:54 AM, Николай Ижиков <
> nizhikov....@gmail.com>
> > > wrote:
> > >
> > > > Hello, guys.
> > > >
> > > > I have implemented basic support of Spark Data Frame API [1], [2] for
> > > > Ignite.
> > > > Spark provides API for a custom strategy to optimize queries from
> spark
> > > to
> > > > underlying data source(Ignite).
> > > >
> > > > The goal of optimization(obvious, just to be on the same page):
> > > > Minimize data transfer between Spark and Ignite.
> > > > Speedup query execution.
> > > >
> > > > I see 3 ways to optimize queries:
> > > >
> > > >         1. *Join Reduce* If one make some query that join two or more
> > > > Ignite tables, we have to pass all join info to Ignite and transfer
> to
> > > > Spark only result of table join.
> > > >         To implement it we have to extend current implementation with
> > new
> > > > RelationProvider that can generate all kind of joins for two or more
> > > tables.
> > > >         We should add some tests, also.
> > > >         The question is - how join result should be partitioned?
> > > >
> > > >
> > > >         2. *Order by* If one make some query to Ignite table with
> order
> > > by
> > > > clause we can execute sorting on Ignite side.
> > > >         But it seems that currently Spark doesn’t have any way to
> tell
> > > > that partitions already sorted.
> > > >
> > > >
> > > >         3. *Key filter* If one make query with `WHERE key = XXX` or
> > > `WHERE
> > > > key IN (X, Y, Z)`, we can reduce number of partitions.
> > > >         And query only partitions that store certain key values.
> > > >         Is this kind of optimization already built in Ignite or I
> > should
> > > > implement it by myself?
> > > >
> > > > May be, there is any other way to make queries run faster?
> > > >
> > > > [1] https://spark.apache.org/docs/latest/sql-programming-guide.html
> > > > [2] https://github.com/apache/ignite/pull/2742
> > > >
> > >
> >
>
>
>
> --
> Nikolay Izhikov
> nizhikov....@gmail.com
>

Re: Optimization of SQL queries from Spark Data Frame to Ignite

Reply via email to