Hello, I am interested to get the cost of the query plans as calculated by the CBO. How can I get that information ? For example, consider a query with a three way join of the following form:
Query ===== insert overwrite table output_tab select a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as total_sales from tableA a join tableB b on a.a_day = b.b_day and a.a_product = b.b_product join tableC c on b.b_day = c.c_day and b.b_alternate = c.c_alternate; The number of rows for tableA, tableB, and tableC are of the order of 10000. I believe, that by "analyzing columns" of all the tables Hive will have statistics regarding the number of rows, distinct values, etc. However, when I try to print out the operator tree as determined by the CalcitePlanner, I get the following output. Print out of the Operator Tree ====================== HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2], total_sales=[+(+($6, $3), $9)]): rowcount =* 1.0*, cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 150 HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner], algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 148 HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner], algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id = 143 HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2], b_sales=[$3]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 138 HiveTableScan(table=[[default.tableb]]): rowcount = 1.0, cumulative cost = {0}, id = 44 HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 141 HiveTableScan(table=[[default.tablea]]): rowcount = 1.0, cumulative cost = {0}, id = 42 HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146 HiveTableScan(table=[[default.tablec]]): rowcount = 1.0, cumulative cost = {0}, id = 47 The number of rows as displayed here is 1.0, which is clearly not the correct value. - Raajay.