Optimization of SQL queries from Spark Data Frame to Ignite

Николай Ижиков Tue, 28 Nov 2017 09:55:01 -0800

Hello, guys.

I have implemented basic support of Spark Data Frame API [1], [2] for Ignite.
Spark provides API for a custom strategy to optimize queries from spark to 
underlying data source(Ignite).


The goal of optimization(obvious, just to be on the same page):
Minimize data transfer between Spark and Ignite.
Speedup query execution.

I see 3 ways to optimize queries:

        1. *Join Reduce* If one make some query that join two or more Ignite 
tables, we have to pass all join info to Ignite and transfer to Spark only 
result of table join.
        To implement it we have to extend current implementation with new 
RelationProvider that can generate all kind of joins for two or more tables.
        We should add some tests, also.
        The question is - how join result should be partitioned?


        2. *Order by* If one make some query to Ignite table with order by 
clause we can execute sorting on Ignite side.
        But it seems that currently Spark doesn’t have any way to tell that 
partitions already sorted.


        3. *Key filter* If one make query with `WHERE key = XXX` or `WHERE key 
IN (X, Y, Z)`, we can reduce number of partitions.
        And query only partitions that store certain key values.
        Is this kind of optimization already built in Ignite or I should 
implement it by myself?

May be, there is any other way to make queries run faster?

[1] https://spark.apache.org/docs/latest/sql-programming-guide.html
[2] https://github.com/apache/ignite/pull/2742

Optimization of SQL queries from Spark Data Frame to Ignite

Reply via email to