Hi John, I am on Hive-2.0.0. I forked of the Hive master branch 2 weeks back (commit id: 763cb02b5eafb0ecd3fd0eb512636a1b092df671).
I actually have "analyze" before I execute the query. I left it out for brevity. Please find the entire query (sent to hive in a file) below. Without the analyze commands, I find that CBO optimization is ignored as expected. Perhaps I am missing some configuration. I print out the calcite optimized plans, using the "RelOptUtil.toString()" helper on "calciteOptimizedPlan" at the end of "apply" function in CalcitePlannerAction. - Raajay Query ===== -- Set the hive configuration -- clear out the existings tables DROP TABLE tableA; DROP TABLE tableB; DROP TABLE tableC; DROP TABLE output_tab; -- create the tables and load the data create external table tableA (a_day int, a_product string, a_alternate string, a_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '${hiveconf:CODE_DIR}/data/test/tableA_sc1000_nd7.txt' OVERWRITE INTO TABLE tableA; create external table tableB (b_day int, b_product string, b_alternate string, b_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '${hiveconf:CODE_DIR}/data/test/tableB_sc1000_nd7.txt' OVERWRITE INTO TABLE tableB; create external table tableC (c_day int, c_product string, c_alternate string, c_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '${hiveconf:CODE_DIR}/data/test/tableC_sc1000_nd7.txt' OVERWRITE INTO TABLE tableC; -- compute statistics to be used by calcite CBO analyze table tableA compute statistics for columns; analyze table tableB compute statistics for columns; analyze table tableC compute statistics for columns; -- create output tables create table output_tab (a_day int, a_product string, a_alternate string, total_sales int); -- the query insert overwrite table output_tab select a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as total_sales from tableA a join tableB b on a.a_day = b.b_day and a.a_product = b.b_product join tableC c on b.b_day = c.c_day and b.b_alternate = c.c_alternate; On Mon, Aug 24, 2015 at 7:25 PM, John Pullokkaran < jpullokka...@hortonworks.com> wrote: > In addition to col stats you also need table stats. > > From: John Pullokkaran <jpullokka...@hortonworks.com> > Reply-To: "u...@hive.apache.org" <u...@hive.apache.org> > Date: Monday, August 24, 2015 at 5:23 PM > To: "u...@hive.apache.org" <u...@hive.apache.org> > Cc: "dev@hive.apache.org" <dev@hive.apache.org> > Subject: Re: CBO - get cost of the plan > > Raajay, > > You don’t have col stats hence it assumes 1 for row count. > What version of Hive are you on? > > Thanks > John > > From: Raajay <raaja...@gmail.com> > Reply-To: "u...@hive.apache.org" <u...@hive.apache.org> > Date: Monday, August 24, 2015 at 5:19 PM > To: "u...@hive.apache.org" <u...@hive.apache.org> > Cc: "dev@hive.apache.org" <dev@hive.apache.org> > Subject: CBO - get cost of the plan > > Hello, > > I am interested to get the cost of the query plans as calculated by the > CBO. How can I get that information ? For example, consider a query with a > three way join of the following form: > > Query > ===== > > insert overwrite table output_tab > select > a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as total_sales > from > tableA a join tableB b > on a.a_day = b.b_day and a.a_product = b.b_product > join tableC c > on b.b_day = c.c_day and b.b_alternate = c.c_alternate; > > > The number of rows for tableA, tableB, and tableC are of the order of > 10000. I believe, that by "analyzing columns" of all the tables Hive will > have statistics regarding the number of rows, distinct values, etc. > However, when I try to print out the operator tree as determined by the > CalcitePlanner, I get the following output. > > Print out of the Operator Tree > ====================== > > HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2], > total_sales=[+(+($6, $3), $9)]): rowcount =* 1.0*, cumulative cost = {4.0 > rows, 0.0 cpu, 0.0 io}, id = 150 > HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner], > algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, > cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 148 > HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner], > algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, > cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id = 143 > HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2], > b_sales=[$3]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 > io}, id = 138 > HiveTableScan(table=[[default.tableb]]): rowcount = 1.0, > cumulative cost = {0}, id = 44 > HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount = > 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 141 > HiveTableScan(table=[[default.tablea]]): rowcount = 1.0, > cumulative cost = {0}, id = 42 > HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount = > 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146 > HiveTableScan(table=[[default.tablec]]): rowcount = 1.0, cumulative > cost = {0}, id = 47 > > > The number of rows as displayed here is 1.0, which is clearly not the > correct value. > > - Raajay. > > >