Dear all,

I am testing the efficiency of multiple join in pig. To let join be
executed, I use the count star executer. And since the count in pig need
group operation firstly, I optimize the operation by converting the
following query:

Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY cs_item_sk;
Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
catalog_returns BY (cr_item_sk, cr_order_number);
b = foreach Bad_OrderRes generate cr_returned_date_sk, cr_returned_time_sk,
cr_item_sk, cr_refunded_customer_sk, cs_sold_date_sk, cs_sold_time_sk,
cs_item_sk, cs_order_number, inv_date_sk, inv_item_sk, inv_warehouse_sk,
inv_quantity_on_hand;

*Original:*
b_group = GROUP b ALL;
b_count = FOREACH b_group GENERATE COUNT(b);
Dump b_count;

TO:
*New:*
ones = FOREACH b GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count;

Is it a good way to test the join efficiency decreasing the influence of
count operation?

Thanks
Mingda

Reply via email to