[ 
https://issues.apache.org/jira/browse/PIG-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4652:
------------------------------------
    Fix Version/s:     (was: 0.17.0)
                   0.18.0

> [Pig on Tez] Key Comparison is slower than mapreduce
> ----------------------------------------------------
>
>                 Key: PIG-4652
>                 URL: https://issues.apache.org/jira/browse/PIG-4652
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.18.0
>
>
> Tez is using PigTupleSortComparator on both map and reduce side and in 
> POShuffleTezLoad.  Mapreduce is using PigTupleWritableComparator on the map 
> and reduce side for comparing tuples which is byte only comparison and very 
> fast.  It then uses PigGrouping<DataType>WritableComparator as the grouping 
> comparator to correctly group those keys. 
>   It is not possible to use similar method in Tez (PigTupleWritableComparator 
> for output and input and PigTupleSortComparator in POShuffleTezLoad), without 
> addition of APIs in Tez to get raw bytes of the keys. Because when we compare 
> multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be 
> compared to maintain the same order as the mapside. In mapreduce, there was 
> only single input and mapreduce framework sorted them together. But in Tez, 
> the join inputs are sorted separately and the application only gets the 
> serialized key. Need APIs in Tez KeyValuesReader to get the bytes of the 
> current key as well which can be used in POShuffleTezLoad for min key 
> comparison.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to