Calcite differs from Catalyst in many ways. First of all, Catalyst is essentially a heuristic optimizer, while Calcite optimizers often combine heuristics and cost-based optimization. Catalyst pushes down predicates and projections to most data sources, while Calcite can often push down full queries. It's certainly also capable of pushing down filters for struct fields. Some of these types of features like SPARK-19609 may have to be implemented as custom rules. But we've successfully replaced Spark's Catalyst optimizer with Calcite and have recorded up to two orders of magnitude improvements in performance running TPC-DS queries against many databases.
Whether there's value in using Calcite in Spark depends on your use case. Drill and other systems are certainly sufficient to take better advantage of the features of underlying databases. It's not easy to build the conversions between Catalyst plans and Calcite plans - it took us months - but doing so allowed us to continue using Spark's popular programmatic APIs while significantly improving its performance when querying relational databases, Mongo, etc. > On Feb 16, 2017, at 3:28 PM, Nick Dimiduk <[email protected]> wrote: > > Heya, > > I've been using Spark recently and have stumbled across a couple surprising > bugs/feature gaps. It got me curious about how Calcite would handle the > same scenarios. Basically, I'm wondering if Calcite would handle these > scenarios directly or if it would defer to the underlying runtime. I.E., > would I be better off for this task with Calcite via Hive or Drill vs. > Catalyst via Spark. > > Here are the tickets for reference. > > SPARK-19615 Provide Dataset union convenience for divergent schema > SPARK-19609 Broadcast joins should pushdown join constraints as Filter to > the larger relation > SPARK-19638 Filter pushdown not working for struct fields > > Thanks in advance! > Nick
