Hello, guys.
I have implemented basic support of Spark Data Frame API [1], [2] for Ignite.
Spark provides API for a custom strategy to optimize queries from spark to
underlying data source(Ignite).
The goal of optimization(obvious, just to be on the same page):
Minimize data transfer between Spark and Ignite.
Speedup query execution.
I see 3 ways to optimize queries:
1. *Join Reduce* If one make some query that join two or more Ignite
tables, we have to pass all join info to Ignite and transfer to Spark only
result of table join.
To implement it we have to extend current implementation with new
RelationProvider that can generate all kind of joins for two or more tables.
We should add some tests, also.
The question is - how join result should be partitioned?
2. *Order by* If one make some query to Ignite table with order by
clause we can execute sorting on Ignite side.
But it seems that currently Spark doesn’t have any way to tell that
partitions already sorted.
3. *Key filter* If one make query with `WHERE key = XXX` or `WHERE key
IN (X, Y, Z)`, we can reduce number of partitions.
And query only partitions that store certain key values.
Is this kind of optimization already built in Ignite or I should
implement it by myself?
May be, there is any other way to make queries run faster?
[1] https://spark.apache.org/docs/latest/sql-programming-guide.html
[2] https://github.com/apache/ignite/pull/2742