Re: Calcite vs Catalyst

Julian Hyde Wed, 31 May 2017 15:48:07 -0700

Thanks for the update, Jordan. I’m still hopeful that it will happen. There is 
a lot of demand for Calcite-on-Spark (and apparently people willing to 
contribute) and this code donation would focus those efforts.


Julian


> On May 31, 2017, at 3:42 PM, Jordan Halterman <[email protected]> 
> wrote:
> 
> I left DataScience a few months ago to go lead distributed systems work at 
> the Open Networking Laboratory. Originally, the week I left they said they 
> intended to open source it, but that never happened. I no longer have control 
> over the project, but I’m still in touch with my old partner in crime, Jason 
> Slepicka. DataScience has been hesitant to open source it for financial 
> reasons that I disagree with, but Jason recently mentioned they’re warming to 
> the idea again. Hopefully that will happen soon. Sitting on it for so long 
> isn’t helping, but it’s out of my hands for now.
> 
> Jordan
>> On May 31, 2017, at 10:44 AM, Khai Tran <[email protected]> wrote:
>> 
>> Hi Jordan,
>> Just want to check if you guys have any plan to contribute back the work of 
>> converting back and forth between Calcite and Spark/Catalyst plans?
>> 
>> Thanks,
>> Khai
>> 
>> On Thu, Feb 16, 2017 at 3:42 PM, [email protected] 
>> <mailto:[email protected]> <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Calcite differs from Catalyst in many ways. First of all, Catalyst is 
>> essentially a heuristic optimizer, while Calcite optimizers often combine 
>> heuristics and cost-based optimization. Catalyst pushes down predicates and 
>> projections to most data sources, while Calcite can often push down full 
>> queries. It's certainly also capable of pushing down filters for struct 
>> fields. Some of these types of features like SPARK-19609 may have to be 
>> implemented as custom rules. But we've successfully replaced Spark's 
>> Catalyst optimizer with Calcite and have recorded up to two orders of 
>> magnitude improvements in performance running TPC-DS queries against many 
>> databases.
>> 
>> Whether there's value in using Calcite in Spark depends on your use case. 
>> Drill and other systems are certainly sufficient to take better advantage of 
>> the features of underlying databases. It's not easy to build the conversions 
>> between Catalyst plans and Calcite plans - it took us months - but doing so 
>> allowed us to continue using Spark's popular programmatic APIs while 
>> significantly improving its performance when querying relational databases, 
>> Mongo, etc.
>> 
>>> On Feb 16, 2017, at 3:28 PM, Nick Dimiduk <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Heya,
>>> 
>>> I've been using Spark recently and have stumbled across a couple surprising
>>> bugs/feature gaps. It got me curious about how Calcite would handle the
>>> same scenarios. Basically, I'm wondering if Calcite would handle these
>>> scenarios directly or if it would defer to the underlying runtime. I.E.,
>>> would I be better off for this task with Calcite via Hive or Drill vs.
>>> Catalyst via Spark.
>>> 
>>> Here are the tickets for reference.
>>> 
>>> SPARK-19615 Provide Dataset union convenience for divergent schema
>>> SPARK-19609 Broadcast joins should pushdown join constraints as Filter to
>>> the larger relation
>>> SPARK-19638 Filter pushdown not working for struct fields
>>> 
>>> Thanks in advance!
>>> Nick
>> 
>

Re: Calcite vs Catalyst

Reply via email to