Re: Calcite's CBO

Lekshmi Thu, 07 Feb 2019 01:56:18 -0800

Dear Michael Mior,
   Thank you so much for the reply. But still, I'm confused with following


When I connected PostgreSQL with Calcite and execute TPCH queries, provided
data residing in the Postgres database, I got the cheapest plan like below
for TPCH query 1.

EnumerableProject(l_returnflag=[$0], l_linestatus=[$1], SUM_QTY=[CASE(=($3,
0), null, $2)], SUM_BASE_PRICE=[CASE(=($5, 0), null, $4)],
SUM_DISC_PRICE=[CASE(=($7, 0), null, $6)], SUM_CHARGE=[CASE(=($9, 0), null,
$8)], AVG_QTY=[/(CASE(=($3, 0), null, $2), $3)], AVG_PRICE=[/(CASE(=($5,
0), null, $4), $5)], AVG_DISC=[/(CASE(=($11, 0), null, $10), $11)],
COUNT_ORDER=[$12]): rowcount = 1.0, cumulative cost = {216.75 rows,
1870.344248356904 cpu, 0.0 io}, id = 316

  EnumerableLimit(fetch=[1]): rowcount = 1.0, cumulative cost = {215.75
rows, 1860.344248356904 cpu, 0.0 io}, id = 315

    EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]):
rowcount = 10.0, cumulative cost = {214.75 rows, 1859.344248356904 cpu, 0.0
io}, id = 314

      JdbcToEnumerableConverter: rowcount = 10.0, cumulative cost = {204.75
rows, 662.0 cpu, 0.0 io}, id = 313

        JdbcAggregate(group=[{0, 1}], SUM_QTY=[$SUM0($2)],
agg#1=[COUNT($2)], SUM_BASE_PRICE=[$SUM0($3)], agg#3=[COUNT($3)],
SUM_DISC_PRICE=[$SUM0($4)], agg#5=[COUNT($4)], SUM_CHARGE=[$SUM0($5)],
agg#7=[COUNT($5)], agg#8=[$SUM0($6)], agg#9=[COUNT($6)],
COUNT_ORDER=[COUNT()]): rowcount = 10.0, cumulative cost = {203.75 rows,
661.0 cpu, 0.0 io}, id = 312

          JdbcProject(l_returnflag=[$8], l_linestatus=[$9],
l_quantity=[$4], l_extendedprice=[$5], $f4=[*($5, -(1, $6))], $f5=[*(*($5,
-(1, $6)), +(1, $7))], l_discount=[$6]): rowcount = 100.0, cumulative cost
= {180.0 rows, 661.0 cpu, 0.0 io}, id = 311

            JdbcTableScan(table=[[TPCH, lineitem]]): rowcount = 100.0,
cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 0


But I didn't understand why I/O cost is always zero? And why row count is
not an integer? Is row count is not actually a count? What is an I/O cost?
Is it count of I/O read for required input or time taken for reading
required input? because i/o is also not an integer above.







Thanks and Regards

Lekshmi B.G
Email: [email protected]




On Thu, Feb 7, 2019 at 1:37 AM Michael Mior <[email protected]> wrote:

> You're correct that it's not a single value. That said, in practice,
> not much is actually done with. CPU or I/O cost. You'll see in
> TableScan that computeSelfCost returns a cost which really only makes
> use of the number of rows in the table. Individual adapters may
> override this, but few do.
> --
> Michael Mior
> [email protected]
>
> Le mer. 6 févr. 2019 à 05:22, Lekshmi <[email protected]> a écrit :
> >
> > Hi,
> >  In [0], they suggest, "the optimizer implementer can choose the cost to
> be
> > a number or record". Which does one Apache Calcite use? I found, it as a
> > record, when I run queries in debug mode.  Is that correct? Then can you
> > please define (rows, CPU, I/O) in Calcite? Also, when, we are connecting
> > calcite across multiple systems, then 'i/o' should be greater than '0.0',
> > right? Because it needs to read the results from the push-down
> operations?
> > Can you please explain these in a bit more detail? I appreciate your
> > support and suggestions.
> >
> >
> > Thanks and Regards
> >
> > Lekshmi B.G
> > Email: [email protected]
> >
> >
> >
> >
> > On Tue, Feb 5, 2019 at 11:30 PM Michael Mior <[email protected]> wrote:
> >
> > > Calcite's optimizer is based off the Volcano optimizer[0]. In that
> > > paper you'll find an outline of the algorithm which is basically
> > > equivalent to what Calcite uses. Adding multiple systems doesn't
> > > complicate things very much. The main addition used by Calcite is what
> > > we call a "convention" trait that allows the optimizer to deal with
> > > expressions across multiple systems. More details are available in a
> > > recently published paper on Calcite [1].
> > >
> > > One important caveat to note is that the cost model used is not likely
> > > to reflect the actual cost of query execution in many cases. It's
> > > generally "good enough" in that the ordering of plans by cost will be
> > > approximately correct. So although the optimal plan will be selected
> > > according to the cost model, the plan which is actually the best in
> > > practice may not be selected. That said, I would expect Calcite will
> > > pick a plan which is generally quite close to the optimal,  but we
> > > have no guarantee of this.
> > >
> > > [0]
> > >
> https://pdfs.semanticscholar.org/a817/a3e74d1663d9eb35b4baf3161ab16f57df85.pdf
> > > [1] https://arxiv.org/pdf/1802.10233.pdf
> > >
> > > --
> > > Michael Mior
> > > [email protected]
> > >
> > > Le mar. 5 févr. 2019 à 15:52, Lekshmi <[email protected]> a écrit
> :
> > > >
> > > > Hi,
> > > >    I would like to know about the Calcite CBO in detail, including
> how it
> > > > deals with global optimization when multiple processing systems are
> > > > associated with it. Any documentation, pointers are much appreciated.
> > > >
> > > >
> > > > Thanks and Regards
> > > >
> > > > Lekshmi B.G
> > > > Email: [email protected]
> > >
>

Re: Calcite's CBO

Reply via email to