Dear all, I am testing the efficiency of multiple join in pig. To let join be executed, I use the count star executer. And since the count in pig need group operation firstly, I optimize the operation by converting the following query:
Bad_OrderIn = JOIN inventory BY inv_item_sk, catalog_sales BY cs_item_sk; Bad_OrderRes = JOIN Bad_OrderIn BY (cs_item_sk, cs_order_number), catalog_returns BY (cr_item_sk, cr_order_number); b = foreach Bad_OrderRes generate cr_returned_date_sk, cr_returned_time_sk, cr_item_sk, cr_refunded_customer_sk, cs_sold_date_sk, cs_sold_time_sk, cs_item_sk, cs_order_number, inv_date_sk, inv_item_sk, inv_warehouse_sk, inv_quantity_on_hand; *Original:* b_group = GROUP b ALL; b_count = FOREACH b_group GENERATE COUNT(b); Dump b_count; TO: *New:* ones = FOREACH b GENERATE 1 AS one:int; counter_group = GROUP ones ALL; log_count = FOREACH counter_group GENERATE COUNT(ones); dump log_count; Is it a good way to test the join efficiency decreasing the influence of count operation? Thanks Mingda