Very nice, Eli. I've logged https://issues.apache.org/jira/browse/CALCITE-1598 to track. I trust you're aware of the existing Piglet module (though that's the different end of the Pig).
On Fri, Jan 20, 2017 at 5:41 PM, Eli Levine <[email protected]> wrote: > I second Julian's statement on the value of Calcite supporting adapters to > distributed compute engines, such as Spark. > > FWIW, I've been tinkering with a Pig adapter for Calcite [1]. Early days > still but the adapter already supports generating Pig-specific RelNode > trees and converting them to Pig Latin scripts. Would appreciate any > feedback on the approach. Best way to see the supported behavior is by > looking at test cases in PigAdapterTest.java [2]. The code still needs to > be cleaned up to conform to Calcite's checkstyle and Java version > requirements. > > CC'ing Daniel Dai, who has shown interest in Pig adapter work in the past. > > [1] https://github.com/apache/calcite/compare/master...elilevine:master > [2] > https://github.com/elilevine/calcite/blob/master/pig/src/test/java/org/apache/calcite/test/PigAdapterTest.java > > Thanks, > > Eli > > > On Fri, Jan 20, 2017 at 4:41 PM, Julian Hyde <[email protected]> wrote: > >> I agree with Jacques. >> >> Jordan, If your company decides to open-source your work, I would be >> willing - nay, delighted - to replace Calcite's existing Spark >> adapter. As you say, the current adapter is unusable. There is >> considerable appetite for a real distributed compute engine for >> Calcite (not counting Drill and Hive, because they embed Calcite, not >> the other way around) and that would convert into people using, >> hardening and extending your code. >> >> Julian >> >> >> On Fri, Jan 20, 2017 at 4:24 PM, Jacques Nadeau <[email protected]> >> wrote: >> > Jordan, super interesting work you've shared. It would be very cool to >> get >> > this incorporated back into Spark mainline. That would continue to >> broaden >> > Calcite's reach :) >> > >> > On Fri, Jan 20, 2017 at 1:36 PM, [email protected] < >> > [email protected]> wrote: >> > >> >> So, AFAIK the Spark adapter that's inside Calcite is in an unusable >> state >> >> right now. It's still using Spark 1.x and last time I tried it I >> couldn't >> >> get it to run. It probably needs to either be removed or completely >> >> rewritten. But I can certainly offer some guidance on working with Spark >> >> and Calcite. >> >> >> >> As we were discussing on the other thread, I've been doing research on >> >> optimizing Spark queries with Calcite at my company. It may or may not >> be >> >> open sourced some time in the near future, I don't know yet. >> >> >> >> So, there are really a couple ways to go about optimizing Spark queries >> >> using Calcite. The first option is the approach the current code in >> Calcite >> >> takes: use Calcite on RDDs. The code that you see in Calcite seems >> likely >> >> to have been developed prior to Spark SQL existing or at least as an >> >> alternative to Spark SQL. It allows you to run Calcite SQL queries on >> Spark >> >> by converting optimized Calcite plans into Spark RDD operations, using >> RDD >> >> methods for relational expressions and Calcite's Enumerables for row >> >> expressions. >> >> >> >> Alternatively, what we wanted to do when we started our project was >> >> integrate Calcite directly into Spark SQL. Spark SQL/DataFrames/Datasets >> >> are widely used APIs, and we wanted to see if we could apply Calcite's >> >> significantly better optimization techniques to Spark's plans without >> >> breaking the API. So, that's the second way to go about it. What we did >> is >> >> essentially implemented a custom Optimizer (a Spark interface) that >> >> converted from Spark logical plans to Calcite logical plans, used >> Calcite >> >> to optimize the plan, and then converted from Calcite back to Spark. >> >> Essentially, this is a complete replacement of the optimization phase of >> >> Catalyst (Spark's optimizer). But converting from Spark plans to Calcite >> >> plans and back is admittedly a major challenge that has taken months to >> >> perfect for more complex expressions like aggregations/grouping sets. >> >> >> >> So, the two options are really: replace Spark SQL with Calcite, or >> >> integrate Calcite into Spark SQL. The former is a fairly straightforward >> >> use case for Calcite. The latter requires a deep understanding of both >> >> Calcite's and Spark's relational algebra and writing algorithms to >> convert >> >> between the two. But I can say that it has been very successful. We've >> been >> >> able to improve Spark's performance quite significantly on all different >> >> types of data - including flat files - and have seen 1-2 orders of >> >> magnitude improvements in Spark's performance against databases like >> >> Postgres, Redshift, Mongo, etc in TPC-DS benchmarks. >> >> >> >> > On Jan 18, 2017, at 12:25 PM, Riccardo Tommasini < >> >> [email protected]> wrote: >> >> > >> >> > Hello, >> >> > I'm trying to understand how to use the spark adapter. >> >> > >> >> > Does anyone have any example? >> >> > >> >> > Thanks in advance >> >> > >> >> > Riccardo Tommasini >> >> > Master Degree Computer Science >> >> > PhD Student at Politecnico di Milano (Italy) >> >> > streamreasoning.org<http://streamreasoning.org/> >> >> > >> >> > Submitted from an iPhone, I apologise for typos. >> >> >>
