Hi All,

Current Spark CBO implements a cost based multi-way join reordering
algorithm based on the System-R’s paper [Access Path-SIGMOD’79]
<http://x86.cs.duke.edu/courses/spring03/cps216/papers/selinger-etal-1979.pdf>.
When building m-way joins, it uses a bottom-up approach and put all items
(basic joined nodes) into level 0, then build all two-way joins at level 1
from plans at level 0 (single items), then build all 3-way joins ... etc.
The algorithm also considers all combinations including left-deep trees,
bushy trees, and right-deep-trees. It also prunes cartesian product
candidates.

While we still found many *limitations* of current CBO implementation:
1. The current CBO is a rule in logic phase, it only outputs one logical
plan to physical phase optimize, while we cannot make sure the best plan in
logical phase is still the best after physical optimize.

2. In current bottom-up approach, we keeps only one best plan for each
level, while we cannot make sure to get the exact best plan for all from
the best plan for each level.

3. Current cost formula cost = weight * cardinality + (1.0 - weight) *
size from
which the first portion roughly corresponds to the CPU cost and the second
portion roughly corresponds to the I/O cost. The cost formula is over
simplified and. It treats all the join implementations the same way and doesn't
take shuffle and sort cost into consideration, while Shuffle Exchange is
one of the heaviest physical operator in Spark SQL.

4. Equivalent join conditions are not supported. For example, (A join B
join C on a=b and b=c) can be reordered to (A join C join B on a=c and c=b)
which is possible to be more efficient. While in current implementation, we
will not get condition "a=c" so will take "A join C" like a Cartesian
Product and then exclude it.

The bottom-up approach first came up from the System-R optimizer (1979). It
quickly became a standard and many of the modern relation database
optimizers are “System-R style”, for example, Oracle, PostgreSQL, MySQL,
DB2.

As time goes by, new styles optimizer were invented: Volcano(1993) and
Cascades(1995). They are not that famous compared to System-R but still be
wildly used in practice: Microsoft SQL Server, Greenplum Orca, Apache
Calcite. They implement Top-down transformational search algorithm and
provide extensible optimization framework.

A top-down optimization framework can help us solve the above limitations
since it has a more complete search space and combines the logical and
physical phases to have a more accurate cost estimation. And about the
efficiency of having all alternatives plans, Cascades also provides pruning
to save the search space.

What about implementing *a new Cascades style CBO for Spark SQL*?
It could be a new rule in current "Planner" which reads a logical plan
after heuristics rules and outputs a best physical plan with least cost
after reorder and physical implementation rules.

Xiaoju Wu
Phone:+86 17717640807

Reply via email to