Review Request 24876: VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU

Navis Ryu Tue, 19 Aug 2014 18:47:56 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24876/
-----------------------------------------------------------


Review request for hive.


Bugs: HIVE-7664
    https://issues.apache.org/jira/browse/HIVE-7664


Repository: hive-git


Description
-------

In a Group by heavy vectorized Reducer vertex 25% of CPU is spent in 
VectorizedBatchUtil.addRowToBatchFrom().

Looked at the code of VectorizedBatchUtil.addRowToBatchFrom and it looks like 
it wasn't optimized for Vectorized processing.

addRowToBatchFrom is called for every row and for each row and every column in 
the batch getPrimitiveCategory is called to figure the type of each column, 
column types are stored in a HashMap, for VectorGroupByOperator columns types 
won't change between batches, so column types shouldn't be looked up for every 
row.

I recommend storing the column type in StructObjectInspector so that other 
components can leverage this optimization.

Also addRowToBatchFrom has a case statement for every row and every column used 
for type casting I recommend encapsulating the type logic in templatized 
methods.   

{code}
Stack Trace     Sample Count    Percentage(%)
VectorizedBatchUtil.addRowToBatchFrom   86      26.543
   AbstractPrimitiveObjectInspector.getPrimitiveCategory()      34      10.494
   LazyBinaryStructObjectInspector.getStructFieldData   25      7.716
   StandardStructObjectInspector.getStructFieldData     4       1.235
{code}

The query used : 
{code}
select 
    ss_sold_date_sk
from
    store_sales
where
    ss_sold_date between '1998-01-01' and '1998-06-01'
group by ss_item_sk , ss_customer_sk , ss_sold_date_sk
having sum(ss_list_price) > 50000000000000;
{code}


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/ReduceRecordProcessor.java 
2acd842 
  ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorizedBatchUtil.java 
16454e7 

Diff: https://reviews.apache.org/r/24876/diff/


Testing
-------


Thanks,

Navis Ryu

Review Request 24876: VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU

Reply via email to