So, AFAIK the Spark adapter that's inside Calcite is in an unusable state right now. It's still using Spark 1.x and last time I tried it I couldn't get it to run. It probably needs to either be removed or completely rewritten. But I can certainly offer some guidance on working with Spark and Calcite.
As we were discussing on the other thread, I've been doing research on optimizing Spark queries with Calcite at my company. It may or may not be open sourced some time in the near future, I don't know yet. So, there are really a couple ways to go about optimizing Spark queries using Calcite. The first option is the approach the current code in Calcite takes: use Calcite on RDDs. The code that you see in Calcite seems likely to have been developed prior to Spark SQL existing or at least as an alternative to Spark SQL. It allows you to run Calcite SQL queries on Spark by converting optimized Calcite plans into Spark RDD operations, using RDD methods for relational expressions and Calcite's Enumerables for row expressions. Alternatively, what we wanted to do when we started our project was integrate Calcite directly into Spark SQL. Spark SQL/DataFrames/Datasets are widely used APIs, and we wanted to see if we could apply Calcite's significantly better optimization techniques to Spark's plans without breaking the API. So, that's the second way to go about it. What we did is essentially implemented a custom Optimizer (a Spark interface) that converted from Spark logical plans to Calcite logical plans, used Calcite to optimize the plan, and then converted from Calcite back to Spark. Essentially, this is a complete replacement of the optimization phase of Catalyst (Spark's optimizer). But converting from Spark plans to Calcite plans and back is admittedly a major challenge that has taken months to perfect for more complex expressions like aggregations/grouping sets. So, the two options are really: replace Spark SQL with Calcite, or integrate Calcite into Spark SQL. The former is a fairly straightforward use case for Calcite. The latter requires a deep understanding of both Calcite's and Spark's relational algebra and writing algorithms to convert between the two. But I can say that it has been very successful. We've been able to improve Spark's performance quite significantly on all different types of data - including flat files - and have seen 1-2 orders of magnitude improvements in Spark's performance against databases like Postgres, Redshift, Mongo, etc in TPC-DS benchmarks. > On Jan 18, 2017, at 12:25 PM, Riccardo Tommasini > <[email protected]> wrote: > > Hello, > I'm trying to understand how to use the spark adapter. > > Does anyone have any example? > > Thanks in advance > > Riccardo Tommasini > Master Degree Computer Science > PhD Student at Politecnico di Milano (Italy) > streamreasoning.org<http://streamreasoning.org/> > > Submitted from an iPhone, I apologise for typos.
