[jira] [Commented] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

Sergey Shelukhin (JIRA) Mon, 02 May 2016 17:10:28 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267781#comment-15267781
 ]


Sergey Shelukhin commented on HIVE-9660:
----------------------------------------

Hmm. I see, the main difference is that one could track the finished RGs and 
record the length at the end based on stream position, instead of tracking all 
the length changes attributed to the RG while it's active... this will change 
the set-of-active-rgs to set-of-just-finished-rgs (of which there can still be 
several per CB, or RL block), and move tracking logic around to different 
places. The dictionary stuff will still have to be there because the 
direct/dictionary flush each write streams that are separated into RGs out of 
sync with the main writer (data+length for direct, data for dictionary). I am 
not sure if it's worth it at this point... I could change the existing patch to 
do that, or do it in separate JIRA later. If you want to do it from scratch 
that also works ;)

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, 
> HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, 
> HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, 
> HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, 
> HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of 
> extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of 
> compressed buffers for each RG, or end offset, or something, to remove this 
> estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

Reply via email to