Re: Spark and Calcite

[email protected] Fri, 20 Jan 2017 13:37:41 -0800

So, AFAIK the Spark adapter that's inside Calcite is in an unusable state right 
now. It's still using Spark 1.x and last time I tried it I couldn't get it to 
run. It probably needs to either be removed or completely rewritten. But I can 
certainly offer some guidance on working with Spark and Calcite.

As we were discussing on the other thread, I've been doing research on 
optimizing Spark queries with Calcite at my company. It may or may not be open 
sourced some time in the near future, I don't know yet.

So, there are really a couple ways to go about optimizing Spark queries using 
Calcite. The first option is the approach the current code in Calcite takes: 
use Calcite on RDDs. The code that you see in Calcite seems likely to have been 
developed prior to Spark SQL existing or at least as an alternative to Spark 
SQL. It allows you to run Calcite SQL queries on Spark by converting optimized 
Calcite plans into Spark RDD operations, using RDD methods for relational 
expressions and Calcite's Enumerables for row expressions.

Alternatively, what we wanted to do when we started our project was integrate 
Calcite directly into Spark SQL. Spark SQL/DataFrames/Datasets are widely used 
APIs, and we wanted to see if we could apply Calcite's significantly better 
optimization techniques to Spark's plans without breaking the API. So, that's 
the second way to go about it. What we did is essentially implemented a custom 
Optimizer (a Spark interface) that converted from Spark logical plans to 
Calcite logical plans, used Calcite to optimize the plan, and then converted 
from Calcite back to Spark. Essentially, this is a complete replacement of the 
optimization phase of Catalyst (Spark's optimizer). But converting from Spark 
plans to Calcite plans and back is admittedly a major challenge that has taken 
months to perfect for more complex expressions like aggregations/grouping sets.

So, the two options are really: replace Spark SQL with Calcite, or integrate 
Calcite into Spark SQL. The former is a fairly straightforward use case for 
Calcite. The latter requires a deep understanding of both Calcite's and Spark's 
relational algebra and writing algorithms to convert between the two. But I can 
say that it has been very successful. We've been able to improve Spark's 
performance quite significantly on all different types of data - including flat 
files - and have seen 1-2 orders of magnitude improvements in Spark's 
performance against databases like Postgres, Redshift, Mongo, etc in TPC-DS 
benchmarks.

> On Jan 18, 2017, at 12:25 PM, Riccardo Tommasini 
> <[email protected]> wrote:
> 
> Hello,
> I'm trying to understand how to use the spark adapter.
> 
> Does anyone have any example?
> 
> Thanks in advance
> 
> Riccardo Tommasini
> Master Degree Computer Science
> PhD Student at Politecnico di Milano (Italy)
> streamreasoning.org<http://streamreasoning.org/>
> 
> Submitted from an iPhone, I apologise for typos.

Re: Spark and Calcite

Reply via email to