Thanks for the update, Jordan. I’m still hopeful that it will happen. There is a lot of demand for Calcite-on-Spark (and apparently people willing to contribute) and this code donation would focus those efforts.
Julian > On May 31, 2017, at 3:42 PM, Jordan Halterman <[email protected]> > wrote: > > I left DataScience a few months ago to go lead distributed systems work at > the Open Networking Laboratory. Originally, the week I left they said they > intended to open source it, but that never happened. I no longer have control > over the project, but I’m still in touch with my old partner in crime, Jason > Slepicka. DataScience has been hesitant to open source it for financial > reasons that I disagree with, but Jason recently mentioned they’re warming to > the idea again. Hopefully that will happen soon. Sitting on it for so long > isn’t helping, but it’s out of my hands for now. > > Jordan >> On May 31, 2017, at 10:44 AM, Khai Tran <[email protected]> wrote: >> >> Hi Jordan, >> Just want to check if you guys have any plan to contribute back the work of >> converting back and forth between Calcite and Spark/Catalyst plans? >> >> Thanks, >> Khai >> >> On Thu, Feb 16, 2017 at 3:42 PM, [email protected] >> <mailto:[email protected]> <[email protected] >> <mailto:[email protected]>> wrote: >> Calcite differs from Catalyst in many ways. First of all, Catalyst is >> essentially a heuristic optimizer, while Calcite optimizers often combine >> heuristics and cost-based optimization. Catalyst pushes down predicates and >> projections to most data sources, while Calcite can often push down full >> queries. It's certainly also capable of pushing down filters for struct >> fields. Some of these types of features like SPARK-19609 may have to be >> implemented as custom rules. But we've successfully replaced Spark's >> Catalyst optimizer with Calcite and have recorded up to two orders of >> magnitude improvements in performance running TPC-DS queries against many >> databases. >> >> Whether there's value in using Calcite in Spark depends on your use case. >> Drill and other systems are certainly sufficient to take better advantage of >> the features of underlying databases. It's not easy to build the conversions >> between Catalyst plans and Calcite plans - it took us months - but doing so >> allowed us to continue using Spark's popular programmatic APIs while >> significantly improving its performance when querying relational databases, >> Mongo, etc. >> >>> On Feb 16, 2017, at 3:28 PM, Nick Dimiduk <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Heya, >>> >>> I've been using Spark recently and have stumbled across a couple surprising >>> bugs/feature gaps. It got me curious about how Calcite would handle the >>> same scenarios. Basically, I'm wondering if Calcite would handle these >>> scenarios directly or if it would defer to the underlying runtime. I.E., >>> would I be better off for this task with Calcite via Hive or Drill vs. >>> Catalyst via Spark. >>> >>> Here are the tickets for reference. >>> >>> SPARK-19615 Provide Dataset union convenience for divergent schema >>> SPARK-19609 Broadcast joins should pushdown join constraints as Filter to >>> the larger relation >>> SPARK-19638 Filter pushdown not working for struct fields >>> >>> Thanks in advance! >>> Nick >> >
