[ https://issues.apache.org/jira/browse/PIG-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohini Palaniswamy updated PIG-4652: ------------------------------------ Fix Version/s: (was: 0.17.0) 0.18.0 > [Pig on Tez] Key Comparison is slower than mapreduce > ---------------------------------------------------- > > Key: PIG-4652 > URL: https://issues.apache.org/jira/browse/PIG-4652 > Project: Pig > Issue Type: Bug > Reporter: Rohini Palaniswamy > Fix For: 0.18.0 > > > Tez is using PigTupleSortComparator on both map and reduce side and in > POShuffleTezLoad. Mapreduce is using PigTupleWritableComparator on the map > and reduce side for comparing tuples which is byte only comparison and very > fast. It then uses PigGrouping<DataType>WritableComparator as the grouping > comparator to correctly group those keys. > It is not possible to use similar method in Tez (PigTupleWritableComparator > for output and input and PigTupleSortComparator in POShuffleTezLoad), without > addition of APIs in Tez to get raw bytes of the keys. Because when we compare > multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be > compared to maintain the same order as the mapside. In mapreduce, there was > only single input and mapreduce framework sorted them together. But in Tez, > the join inputs are sorted separately and the application only gets the > serialized key. Need APIs in Tez KeyValuesReader to get the bytes of the > current key as well which can be used in POShuffleTezLoad for min key > comparison. -- This message was sent by Atlassian JIRA (v6.3.15#6346)