[ https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267781#comment-15267781 ]
Sergey Shelukhin commented on HIVE-9660: ---------------------------------------- Hmm. I see, the main difference is that one could track the finished RGs and record the length at the end based on stream position, instead of tracking all the length changes attributed to the RG while it's active... this will change the set-of-active-rgs to set-of-just-finished-rgs (of which there can still be several per CB, or RL block), and move tracking logic around to different places. The dictionary stuff will still have to be there because the direct/dictionary flush each write streams that are separated into RGs out of sync with the main writer (data+length for direct, data for dictionary). I am not sure if it's worth it at this point... I could change the existing patch to do that, or do it in separate JIRA later. If you want to do it from scratch that also works ;) > store end offset of compressed data for RG in RowIndex in ORC > ------------------------------------------------------------- > > Key: HIVE-9660 > URL: https://issues.apache.org/jira/browse/HIVE-9660 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, > HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, > HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, > HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, > HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch > > > Right now the end offset is estimated, which in some cases results in tons of > extra data being read. > We can add a separate array to RowIndex (positions_v2?) that stores number of > compressed buffers for each RG, or end offset, or something, to remove this > estimation magic -- This message was sent by Atlassian JIRA (v6.3.4#6332)