Hi John,

I am on Hive-2.0.0. I forked of the Hive master branch  2 weeks back
(commit id: 763cb02b5eafb0ecd3fd0eb512636a1b092df671).

I actually have "analyze" before I execute the query. I left it out for
brevity. Please find the entire query (sent to hive in a file) below.
Without the analyze commands, I find that CBO optimization is ignored as
expected. Perhaps I am missing some configuration.

I print out the calcite optimized plans, using the "RelOptUtil.toString()"
helper on "calciteOptimizedPlan" at the end of "apply" function in
CalcitePlannerAction.


- Raajay



Query
=====

-- Set the hive configuration

-- clear out the existings tables
DROP TABLE tableA;
DROP TABLE tableB;
DROP TABLE tableC;
DROP TABLE output_tab;

-- create the tables and load the data
create external table tableA (a_day int, a_product string, a_alternate
string, a_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
AS TEXTFILE;
LOAD DATA LOCAL INPATH
'${hiveconf:CODE_DIR}/data/test/tableA_sc1000_nd7.txt' OVERWRITE INTO TABLE
tableA;

create external table tableB (b_day int, b_product string, b_alternate
string, b_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
AS TEXTFILE;
LOAD DATA LOCAL INPATH
'${hiveconf:CODE_DIR}/data/test/tableB_sc1000_nd7.txt' OVERWRITE INTO TABLE
tableB;

create external table tableC (c_day int, c_product string, c_alternate
string, c_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED
AS TEXTFILE;
LOAD DATA LOCAL INPATH
'${hiveconf:CODE_DIR}/data/test/tableC_sc1000_nd7.txt' OVERWRITE INTO TABLE
tableC;

-- compute statistics to be used by calcite CBO
analyze table tableA compute statistics for columns;
analyze table tableB compute statistics for columns;
analyze table tableC compute statistics for columns;

-- create output tables
create table output_tab (a_day int, a_product string, a_alternate string,
total_sales int);

-- the query
insert overwrite table output_tab
select
a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as total_sales
from
tableA a join tableB b
on a.a_day = b.b_day and a.a_product = b.b_product
join tableC c
on b.b_day = c.c_day and b.b_alternate = c.c_alternate;


On Mon, Aug 24, 2015 at 7:25 PM, John Pullokkaran <
jpullokka...@hortonworks.com> wrote:

> In addition to col stats you also need table stats.
>
> From: John Pullokkaran <jpullokka...@hortonworks.com>
> Reply-To: "u...@hive.apache.org" <u...@hive.apache.org>
> Date: Monday, August 24, 2015 at 5:23 PM
> To: "u...@hive.apache.org" <u...@hive.apache.org>
> Cc: "dev@hive.apache.org" <dev@hive.apache.org>
> Subject: Re: CBO - get cost of the plan
>
> Raajay,
>
>     You don’t have col stats hence it assumes 1 for row count.
> What version of Hive are you on?
>
> Thanks
> John
>
> From: Raajay <raaja...@gmail.com>
> Reply-To: "u...@hive.apache.org" <u...@hive.apache.org>
> Date: Monday, August 24, 2015 at 5:19 PM
> To: "u...@hive.apache.org" <u...@hive.apache.org>
> Cc: "dev@hive.apache.org" <dev@hive.apache.org>
> Subject: CBO - get cost of the plan
>
> Hello,
>
> I am interested to get the cost of the query plans as calculated by the
> CBO. How can I get that information ? For example, consider a query with a
> three way join of the following form:
>
> Query
> =====
>
> insert overwrite table output_tab
> select
> a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as total_sales
> from
> tableA a join tableB b
> on a.a_day = b.b_day and a.a_product = b.b_product
> join tableC c
> on b.b_day = c.c_day and b.b_alternate = c.c_alternate;
>
>
> The number of rows for tableA, tableB, and tableC are of the order of
> 10000. I believe, that by "analyzing columns" of all the tables Hive will
> have statistics regarding the number of rows, distinct values, etc.
> However, when I try to print out the operator tree as determined by the
> CalcitePlanner, I get the following output.
>
> Print out of the Operator Tree
> ======================
>
> HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2],
> total_sales=[+(+($6, $3), $9)]): rowcount =* 1.0*, cumulative cost = {4.0
> rows, 0.0 cpu, 0.0 io}, id = 150
>   HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner],
> algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0,
> cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 148
>     HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner],
> algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0,
> cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id = 143
>       HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2],
> b_sales=[$3]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0
> io}, id = 138
>         HiveTableScan(table=[[default.tableb]]): rowcount = 1.0,
> cumulative cost = {0}, id = 44
>       HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount =
> 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 141
>         HiveTableScan(table=[[default.tablea]]): rowcount = 1.0,
> cumulative cost = {0}, id = 42
>     HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount =
> 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146
>       HiveTableScan(table=[[default.tablec]]): rowcount = 1.0, cumulative
> cost = {0}, id = 47
>
>
> The number of rows as displayed here is 1.0, which is clearly not the
> correct value.
>
> - Raajay.
>
>
>

Reply via email to