Re: Calcite vs Catalyst

[email protected] Thu, 16 Feb 2017 15:42:56 -0800

Calcite differs from Catalyst in many ways. First of all, Catalyst is 
essentially a heuristic optimizer, while Calcite optimizers often combine 
heuristics and cost-based optimization. Catalyst pushes down predicates and 
projections to most data sources, while Calcite can often push down full 
queries. It's certainly also capable of pushing down filters for struct fields. 
Some of these types of features like SPARK-19609 may have to be implemented as 
custom rules. But we've successfully replaced Spark's Catalyst optimizer with 
Calcite and have recorded up to two orders of magnitude improvements in 
performance running TPC-DS queries against many databases.

Whether there's value in using Calcite in Spark depends on your use case. Drill 
and other systems are certainly sufficient to take better advantage of the 
features of underlying databases. It's not easy to build the conversions 
between Catalyst plans and Calcite plans - it took us months - but doing so 
allowed us to continue using Spark's popular programmatic APIs while 
significantly improving its performance when querying relational databases, 
Mongo, etc.

> On Feb 16, 2017, at 3:28 PM, Nick Dimiduk <[email protected]> wrote:
> 
> Heya,
> 
> I've been using Spark recently and have stumbled across a couple surprising
> bugs/feature gaps. It got me curious about how Calcite would handle the
> same scenarios. Basically, I'm wondering if Calcite would handle these
> scenarios directly or if it would defer to the underlying runtime. I.E.,
> would I be better off for this task with Calcite via Hive or Drill vs.
> Catalyst via Spark.
> 
> Here are the tickets for reference.
> 
> SPARK-19615 Provide Dataset union convenience for divergent schema
> SPARK-19609 Broadcast joins should pushdown join constraints as Filter to
> the larger relation
> SPARK-19638 Filter pushdown not working for struct fields
> 
> Thanks in advance!
> Nick

Re: Calcite vs Catalyst

Reply via email to