[ https://issues.apache.org/jira/browse/HIVE-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14617407#comment-14617407 ]
Kevin Wilfong commented on HIVE-11188: -------------------------------------- We did something like this in DWRF https://github.com/facebook/hive-dwrf/commit/c9205c3894cb04453a790de28270d8118a87101d https://github.com/facebook/hive-dwrf/commit/1e920729a0b3a6887194a54a070e2471fac947d2 > Make ORCFile's String Dictionary more efficient > ----------------------------------------------- > > Key: HIVE-11188 > URL: https://issues.apache.org/jira/browse/HIVE-11188 > Project: Hive > Issue Type: Improvement > Components: File Formats > Affects Versions: 1.2.0, 1.1.0 > Reporter: Zheng Shao > > Currently, ORCFile's String Dictionary uses StringRedBlackTree for > adding/finding/sorting duplicate strings. When there are a large number of > unique strings (let's say over 16K) and a large number of rows (let's say > 1M), the binary search will take O(1M * log(16K)) time which can be very long. > Alternatively, ORCFile's String Dictionary can use HashMap for adding/finding > duplicate strings, and use quicksort at the end to produce a sorted order. > In the same case above, the total time spent will be O(1M + 16K * log(16K)) > which is much faster. > When the number of unique string is close to the number of rows (let's say, > both around 1M), ORC will automatically disable the dictionary encoding. In > the old approach will take O(1M * log(1M)), and our new approach will take > O(1M) since we can skip the final quicksort if the dictionary encoding is > disabled. > So in either case, the new approach should be a win. > Here is an PMP output based on ~600 traces (so 126 means 126/600 ~= 21% of > total time). It's a query like "INSERT OVERWRITE TABLE target SELECT * FROM > src" using hive-1.1.0-cdh-5.4.1. target TABLE is STORED AS ORC, and src TABLE > is STORED AS RCFILE. > 126 > org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.compareValue(StringRedBlackTree.java:67) > 35 java.util.zip.Deflater.deflateBytes(Native Method) > 26 > org.apache.hadoop.hive.ql.io.orc.SerializationUtils.findClosestNumBits(SerializationUtils.java:218) > 24 > org.apache.hadoop.hive.serde2.lazy.LazyNonPrimitive.isNull(LazyNonPrimitive.java:63) > 22 org.apache.hadoop.hive.serde2.lazy.LazyMap.parse(LazyMap.java:204) > 22 > org.apache.hadoop.hive.serde2.lazy.LazyLong.parseLong(LazyLong.java:116) > 21 > org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111) > 19 > org.apache.hadoop.hive.serde2.lazy.LazyPrimitive.hashCode(LazyPrimitive.java:57) > 18 > org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getRight(RedBlackTree.java:99) > 16 > org.apache.hadoop.hive.ql.io.RCFile$Reader.getCurrentRow(RCFile.java:1932) > 15 > org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native > Method) > 15 > org.apache.hadoop.hive.ql.io.orc.WriterImpl$IntegerTreeWriter.write(WriterImpl.java:929) > 12 > org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1607) > 12 org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) > 11 > org.apache.hadoop.hive.ql.io.orc.RedBlackTree.getLeft(RedBlackTree.java:92) > 11 > org.apache.hadoop.hive.ql.io.orc.DynamicIntArray.add(DynamicIntArray.java:105) > 10 org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)