[ https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747635#comment-13747635 ]
Edward Capriolo commented on HIVE-3562: --------------------------------------- {quote}1. Is it possible to use one DataStructure (TreeSet probably) for two implementations of ReduceSinkHash. If so, that will simplify this code quite a bit.{quote} I think we are better off using the MinMaxPriorityQueue http://highscalability.com/blog/2013/5/22/strategy-stop-using-linked-lists.html Structures that use Node's are really inefficient in java. The pointers bloat the structure, and the non-local memory access and causes a lot of cache misses. I benched a similar scenario sorting large tree sets, and even though the big O is smaller (believe it or not) to sort a list then maintain a treeSet. MinMaxPriorityQueue has a middle-ground implementation which real world works much faster then TreeSet. {code} import java.util.ArrayList; import java.util.Collection; import java.util.Collections; import java.util.LinkedList; import java.util.List; import java.util.Random; import java.util.TreeSet; public class TreeSetTest { static Random r = new Random(); public static void main(String [] args){ doIt(500000,new TreeSet<Double>()); doIt(500000,new TreeSet<Double>()); doIt(500000,new TreeSet<Double>()); doIt(500000,new TreeSet<Double>()); { List<Double> arrayList = new ArrayList<Double>(); doIt(500000,arrayList); sortIt(arrayList); } { List<Double> arrayList = new ArrayList<Double>(); doIt(500000,arrayList); sortIt(arrayList); } { List<Double> arrayList = new LinkedList<Double>(); doIt(500000,arrayList); sortIt(arrayList); } { List<Double> arrayList = new ArrayList<Double>(); doIt(500000,arrayList); sortIt(arrayList); } { List<Double> arrayList = new ArrayList<Double>(); doIt(500000,arrayList); sortIt(arrayList); } doIt(500000,new TreeSet<Double>()); doIt(500000,new TreeSet<Double>()); { List<Double> arrayList = new LinkedList<Double>(); doIt(500000,arrayList); sortIt(arrayList); } } public static void sortIt(List l){ long start = System.currentTimeMillis(); Collections.sort(l); long end = System.currentTimeMillis(); System.out.println(l.getClass().getName()+" sort it "+ (end-start)); } public static void doIt(int n, Collection l){ long start = System.currentTimeMillis(); for (int i =0;i<n;i++){ randomDouble(l); } long end = System.currentTimeMillis(); System.out.println(l.getClass().getName()+" add it "+ (end-start)); } private static void randomDouble(Collection l){ int rangeMin=0; int rangeMax=20000000; double randomValue = rangeMin + (rangeMax - rangeMin) * r.nextDouble(); l.add(randomValue); } } {code} > Some limit can be pushed down to map stage > ------------------------------------------ > > Key: HIVE-3562 > URL: https://issues.apache.org/jira/browse/HIVE-3562 > Project: Hive > Issue Type: Bug > Reporter: Navis > Assignee: Navis > Priority: Trivial > Attachments: HIVE-3562.D5967.1.patch, HIVE-3562.D5967.2.patch, > HIVE-3562.D5967.3.patch, HIVE-3562.D5967.4.patch, HIVE-3562.D5967.5.patch, > HIVE-3562.D5967.6.patch > > > Queries with limit clause (with reasonable number), for example > {noformat} > select * from src order by key limit 10; > {noformat} > makes operator tree, > TS-SEL-RS-EXT-LIMIT-FS > But LIMIT can be partially calculated in RS, reducing size of shuffling. > TS-SEL-RS(TOP-N)-EXT-LIMIT-FS -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira