Re: Calcite's CBO

Michael Mior Thu, 07 Feb 2019 07:08:06 -0800

I/O cost is always zero because Calcite itself doesn't try to estimate
I/O cost. As I mentioned. my previous reply, this can be overridden by
adapters. Row count is an estimate which starts off as the estimated
number of rows in the table but then is impacted by the estimated
filter factors of queries. So if the estimated number of rows is 100
and filters are applied that remove an estimated 95% and 30% of rows,
then the estimated rows produced by the query will be 3.5.


--
Michael Mior
[email protected]

Le jeu. 7 févr. 2019 à 04:55, Lekshmi <[email protected]> a écrit :
>
> Dear Michael Mior,
>    Thank you so much for the reply. But still, I'm confused with following
>
> When I connected PostgreSQL with Calcite and execute TPCH queries, provided
> data residing in the Postgres database, I got the cheapest plan like below
> for TPCH query 1.
>
> EnumerableProject(l_returnflag=[$0], l_linestatus=[$1], SUM_QTY=[CASE(=($3,
> 0), null, $2)], SUM_BASE_PRICE=[CASE(=($5, 0), null, $4)],
> SUM_DISC_PRICE=[CASE(=($7, 0), null, $6)], SUM_CHARGE=[CASE(=($9, 0), null,
> $8)], AVG_QTY=[/(CASE(=($3, 0), null, $2), $3)], AVG_PRICE=[/(CASE(=($5,
> 0), null, $4), $5)], AVG_DISC=[/(CASE(=($11, 0), null, $10), $11)],
> COUNT_ORDER=[$12]): rowcount = 1.0, cumulative cost = {216.75 rows,
> 1870.344248356904 cpu, 0.0 io}, id = 316
>
>   EnumerableLimit(fetch=[1]): rowcount = 1.0, cumulative cost = {215.75
> rows, 1860.344248356904 cpu, 0.0 io}, id = 315
>
>     EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]):
> rowcount = 10.0, cumulative cost = {214.75 rows, 1859.344248356904 cpu, 0.0
> io}, id = 314
>
>       JdbcToEnumerableConverter: rowcount = 10.0, cumulative cost = {204.75
> rows, 662.0 cpu, 0.0 io}, id = 313
>
>         JdbcAggregate(group=[{0, 1}], SUM_QTY=[$SUM0($2)],
> agg#1=[COUNT($2)], SUM_BASE_PRICE=[$SUM0($3)], agg#3=[COUNT($3)],
> SUM_DISC_PRICE=[$SUM0($4)], agg#5=[COUNT($4)], SUM_CHARGE=[$SUM0($5)],
> agg#7=[COUNT($5)], agg#8=[$SUM0($6)], agg#9=[COUNT($6)],
> COUNT_ORDER=[COUNT()]): rowcount = 10.0, cumulative cost = {203.75 rows,
> 661.0 cpu, 0.0 io}, id = 312
>
>           JdbcProject(l_returnflag=[$8], l_linestatus=[$9],
> l_quantity=[$4], l_extendedprice=[$5], $f4=[*($5, -(1, $6))], $f5=[*(*($5,
> -(1, $6)), +(1, $7))], l_discount=[$6]): rowcount = 100.0, cumulative cost
> = {180.0 rows, 661.0 cpu, 0.0 io}, id = 311
>
>             JdbcTableScan(table=[[TPCH, lineitem]]): rowcount = 100.0,
> cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 0
>
>
> But I didn't understand why I/O cost is always zero? And why row count is
> not an integer? Is row count is not actually a count? What is an I/O cost?
> Is it count of I/O read for required input or time taken for reading
> required input? because i/o is also not an integer above.
>
>
>
>
>
>
>
> Thanks and Regards
>
> Lekshmi B.G
> Email: [email protected]
>
>
>
>
> On Thu, Feb 7, 2019 at 1:37 AM Michael Mior <[email protected]> wrote:
>
> > You're correct that it's not a single value. That said, in practice,
> > not much is actually done with. CPU or I/O cost. You'll see in
> > TableScan that computeSelfCost returns a cost which really only makes
> > use of the number of rows in the table. Individual adapters may
> > override this, but few do.
> > --
> > Michael Mior
> > [email protected]
> >
> > Le mer. 6 févr. 2019 à 05:22, Lekshmi <[email protected]> a écrit :
> > >
> > > Hi,
> > >  In [0], they suggest, "the optimizer implementer can choose the cost to
> > be
> > > a number or record". Which does one Apache Calcite use? I found, it as a
> > > record, when I run queries in debug mode.  Is that correct? Then can you
> > > please define (rows, CPU, I/O) in Calcite? Also, when, we are connecting
> > > calcite across multiple systems, then 'i/o' should be greater than '0.0',
> > > right? Because it needs to read the results from the push-down
> > operations?
> > > Can you please explain these in a bit more detail? I appreciate your
> > > support and suggestions.
> > >
> > >
> > > Thanks and Regards
> > >
> > > Lekshmi B.G
> > > Email: [email protected]
> > >
> > >
> > >
> > >
> > > On Tue, Feb 5, 2019 at 11:30 PM Michael Mior <[email protected]> wrote:
> > >
> > > > Calcite's optimizer is based off the Volcano optimizer[0]. In that
> > > > paper you'll find an outline of the algorithm which is basically
> > > > equivalent to what Calcite uses. Adding multiple systems doesn't
> > > > complicate things very much. The main addition used by Calcite is what
> > > > we call a "convention" trait that allows the optimizer to deal with
> > > > expressions across multiple systems. More details are available in a
> > > > recently published paper on Calcite [1].
> > > >
> > > > One important caveat to note is that the cost model used is not likely
> > > > to reflect the actual cost of query execution in many cases. It's
> > > > generally "good enough" in that the ordering of plans by cost will be
> > > > approximately correct. So although the optimal plan will be selected
> > > > according to the cost model, the plan which is actually the best in
> > > > practice may not be selected. That said, I would expect Calcite will
> > > > pick a plan which is generally quite close to the optimal,  but we
> > > > have no guarantee of this.
> > > >
> > > > [0]
> > > >
> > https://pdfs.semanticscholar.org/a817/a3e74d1663d9eb35b4baf3161ab16f57df85.pdf
> > > > [1] https://arxiv.org/pdf/1802.10233.pdf
> > > >
> > > > --
> > > > Michael Mior
> > > > [email protected]
> > > >
> > > > Le mar. 5 févr. 2019 à 15:52, Lekshmi <[email protected]> a écrit
> > :
> > > > >
> > > > > Hi,
> > > > >    I would like to know about the Calcite CBO in detail, including
> > how it
> > > > > deals with global optimization when multiple processing systems are
> > > > > associated with it. Any documentation, pointers are much appreciated.
> > > > >
> > > > >
> > > > > Thanks and Regards
> > > > >
> > > > > Lekshmi B.G
> > > > > Email: [email protected]
> > > >
> >

Re: Calcite's CBO

Reply via email to