Re: Spark and Calcite

Julian Hyde Sat, 21 Jan 2017 20:53:32 -0800

Very nice, Eli. I've logged
https://issues.apache.org/jira/browse/CALCITE-1598 to track. I trust
you're aware of the existing Piglet module (though that's the
different end of the Pig).


On Fri, Jan 20, 2017 at 5:41 PM, Eli Levine <[email protected]> wrote:
> I second Julian's statement on the value of Calcite supporting adapters to
> distributed compute engines, such as Spark.
>
> FWIW, I've been tinkering with a Pig adapter for Calcite [1]. Early days
> still but the adapter already supports generating Pig-specific RelNode
> trees and converting them to Pig Latin scripts. Would appreciate any
> feedback on the approach. Best way to see the supported behavior is by
> looking at test cases in PigAdapterTest.java [2]. The code still needs to
> be cleaned up to conform to Calcite's checkstyle and Java version
> requirements.
>
> CC'ing Daniel Dai, who has shown interest in Pig adapter work in the past.
>
> [1] https://github.com/apache/calcite/compare/master...elilevine:master
> [2]
> https://github.com/elilevine/calcite/blob/master/pig/src/test/java/org/apache/calcite/test/PigAdapterTest.java
>
> Thanks,
>
> Eli
>
>
> On Fri, Jan 20, 2017 at 4:41 PM, Julian Hyde <[email protected]> wrote:
>
>> I agree with Jacques.
>>
>> Jordan, If your company decides to open-source your work, I would be
>> willing - nay, delighted - to replace Calcite's existing Spark
>> adapter. As you say, the current adapter is unusable. There is
>> considerable appetite for a real distributed compute engine for
>> Calcite (not counting Drill and Hive, because they embed Calcite, not
>> the other way around) and that would convert into people using,
>> hardening and extending your code.
>>
>> Julian
>>
>>
>> On Fri, Jan 20, 2017 at 4:24 PM, Jacques Nadeau <[email protected]>
>> wrote:
>> > Jordan, super interesting work you've shared. It would be very cool to
>> get
>> > this incorporated back into Spark mainline. That would continue to
>> broaden
>> > Calcite's reach :)
>> >
>> > On Fri, Jan 20, 2017 at 1:36 PM, [email protected] <
>> > [email protected]> wrote:
>> >
>> >> So, AFAIK the Spark adapter that's inside Calcite is in an unusable
>> state
>> >> right now. It's still using Spark 1.x and last time I tried it I
>> couldn't
>> >> get it to run. It probably needs to either be removed or completely
>> >> rewritten. But I can certainly offer some guidance on working with Spark
>> >> and Calcite.
>> >>
>> >> As we were discussing on the other thread, I've been doing research on
>> >> optimizing Spark queries with Calcite at my company. It may or may not
>> be
>> >> open sourced some time in the near future, I don't know yet.
>> >>
>> >> So, there are really a couple ways to go about optimizing Spark queries
>> >> using Calcite. The first option is the approach the current code in
>> Calcite
>> >> takes: use Calcite on RDDs. The code that you see in Calcite seems
>> likely
>> >> to have been developed prior to Spark SQL existing or at least as an
>> >> alternative to Spark SQL. It allows you to run Calcite SQL queries on
>> Spark
>> >> by converting optimized Calcite plans into Spark RDD operations, using
>> RDD
>> >> methods for relational expressions and Calcite's Enumerables for row
>> >> expressions.
>> >>
>> >> Alternatively, what we wanted to do when we started our project was
>> >> integrate Calcite directly into Spark SQL. Spark SQL/DataFrames/Datasets
>> >> are widely used APIs, and we wanted to see if we could apply Calcite's
>> >> significantly better optimization techniques to Spark's plans without
>> >> breaking the API. So, that's the second way to go about it. What we did
>> is
>> >> essentially implemented a custom Optimizer (a Spark interface) that
>> >> converted from Spark logical plans to Calcite logical plans, used
>> Calcite
>> >> to optimize the plan, and then converted from Calcite back to Spark.
>> >> Essentially, this is a complete replacement of the optimization phase of
>> >> Catalyst (Spark's optimizer). But converting from Spark plans to Calcite
>> >> plans and back is admittedly a major challenge that has taken months to
>> >> perfect for more complex expressions like aggregations/grouping sets.
>> >>
>> >> So, the two options are really: replace Spark SQL with Calcite, or
>> >> integrate Calcite into Spark SQL. The former is a fairly straightforward
>> >> use case for Calcite. The latter requires a deep understanding of both
>> >> Calcite's and Spark's relational algebra and writing algorithms to
>> convert
>> >> between the two. But I can say that it has been very successful. We've
>> been
>> >> able to improve Spark's performance quite significantly on all different
>> >> types of data - including flat files - and have seen 1-2 orders of
>> >> magnitude improvements in Spark's performance against databases like
>> >> Postgres, Redshift, Mongo, etc in TPC-DS benchmarks.
>> >>
>> >> > On Jan 18, 2017, at 12:25 PM, Riccardo Tommasini <
>> >> [email protected]> wrote:
>> >> >
>> >> > Hello,
>> >> > I'm trying to understand how to use the spark adapter.
>> >> >
>> >> > Does anyone have any example?
>> >> >
>> >> > Thanks in advance
>> >> >
>> >> > Riccardo Tommasini
>> >> > Master Degree Computer Science
>> >> > PhD Student at Politecnico di Milano (Italy)
>> >> > streamreasoning.org<http://streamreasoning.org/>
>> >> >
>> >> > Submitted from an iPhone, I apologise for typos.
>> >>
>>

Re: Spark and Calcite

Reply via email to