[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714795#comment-13714795 ]
Owen O'Malley commented on HIVE-4123: ------------------------------------- More comments: * I don't see why bitpack reader/writer are more than static methods that read/write to the underlying stream. So I would have expected a method like writeInts(long[] data, int offset, int length, int numBits, OutputStream stream) and the corresponding one for reading. * Utils.bytesToLongBE should take an input stream rather than a byte[]. * In IntegerCompressionReader: ** I'd write a method to translate the int into an opcode rather than use ordinal. ** It is probably worth remembering that you are in a repeat, so that you don't need to copy the value N times in short repeat. ** It may be easier to loop through the base values and then run through the patches. You might even do three loops: unpack the main values, unpack the patches, add the base to each value. ** For patched based only the base is zigzag encoded. The rest of the values are always positive. ** For delta only the base and base delta are zigzag encoded. * In IntegerCompressionWriter: ** You should give more comments about the patched base encoding. ** Instead of sorting for the percentiles, you could keep a count of how many values use each number of bits. ** Replace the commented out printlns with LOG.debug surrounded by LOG.ifDebugEnabled ** flush should use if/then/else to prevent writing the data twice ** the constructor should probably call clear rather than risk having the default values be different ** in write, just copy the data with system.arraycopy instead of cloning the array ** write should track whether the values are monotonically increasing or decreasing so that we know if delta applies ** there is a lot of duplication of effort in determine encoding ** if the sequence is both increasing and decreasing, it is constant and we should either use short literal or delta depending on the length ** delta encoding should return before doing the percentile work ** * How much unit test coverage do you have of the new code? * Have you run the encoder/decoder round trip over the github data to test it? > The RLE encoding for ORC can be improved > ---------------------------------------- > > Key: HIVE-4123 > URL: https://issues.apache.org/jira/browse/HIVE-4123 > Project: Hive > Issue Type: New Feature > Components: File Formats > Reporter: Owen O'Malley > Assignee: Prasanth J > Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, > ORC-Compression-Ratio-Comparison.xlsx > > > The run length encoding of integers can be improved: > * tighter bit packing > * allow delta encoding > * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira