[ https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570888#comment-13570888 ]
Kevin Wilfong commented on HIVE-3874: ------------------------------------- @Owen: Here's a couple more issues I ran into, and again I can file JIRAs for these later once the code is checked in. Incorrect deserialization of doubles (leads to a lot of NaNs) https://reviews.facebook.net/D8379 Strings are written incorrectly when they span two chunks of a DynamicByteArray E.g. say the original string is 'abcdefghi' the string written in the ORC file may be 'abcdefabc' https://reviews.facebook.net/D8385 > Create a new Optimized Row Columnar file format for Hive > -------------------------------------------------------- > > Key: HIVE-3874 > URL: https://issues.apache.org/jira/browse/HIVE-3874 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Attachments: hive.3874.2.patch, OrcFileIntro.pptx, orc.tgz > > > There are several limitations of the current RC File format that I'd like to > address by creating a new format: > * each column value is stored as a binary blob, which means: > ** the entire column value must be read, decompressed, and deserialized > ** the file format can't use smarter type-specific compression > ** push down filters can't be evaluated > * the start of each row group needs to be found by scanning > * user metadata can only be added to the file when the file is created > * the file doesn't store the number of rows per a file or row group > * there is no mechanism for seeking to a particular row number, which is > required for external indexes. > * there is no mechanism for storing light weight indexes within the file to > enable push-down filters to skip entire row groups. > * the type of the rows aren't stored in the file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira