[jira] [Commented] (HIVE-8966) Delta files created by hive hcatalog streaming cannot be compacted

Jihong Liu (JIRA) Tue, 09 Dec 2014 22:00:40 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240701#comment-14240701
 ]


Jihong Liu commented on HIVE-8966:
----------------------------------

Alan,
Your idea is very good. But there is an issue here -- we should only do this 
"compacting" test for the most recent delta, not for all deltas. Following is 
an example for the reason:
Assume there are two deltas:
   1  delta_00011_00020    this delta has open transaction batch
   2  delta_00021_00030    this delta has no open transaction batch. All closed.

In the above, the first delta has open transaction batch, the second has not. 
And the second delta is the most recent delta. This case is possible, 
especially when multiple threads write to the same partition. If we ignore the 
first one, then the compaction will success and create a base, like base_00030. 
Then cleaner will delete all the two deltas since their transaction id are less 
or equal to the base transaction id. Thus the data in delta 2 will be lost. 
This is why we should only test the most recent delta, all other deltas will be 
automatically in the list. Thus in this case, the compaction will be fail, 
since the "flush_length" file is there. And for this case, the compaction will 
be success only when all transaction batchs are closed. Although it is not 
perfect, at least no data lost. Since each delta file and transaction id for 
compaction is not saved anywhere, probably this is the only solution for now. 
In my removeNotCompactableDeltas() method, we first sort the deltas, then only 
check the last one. But the name: "removeNotCompactableDeltas" is not good, 
easy makes confusion. It will be clear if named it as 
"removeLastDeltaIfNotCompactable". 
Thanks

> Delta files created by hive hcatalog streaming cannot be compacted
> ------------------------------------------------------------------
>
>                 Key: HIVE-8966
>                 URL: https://issues.apache.org/jira/browse/HIVE-8966
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>    Affects Versions: 0.14.0
>         Environment: hive
>            Reporter: Jihong Liu
>            Assignee: Alan Gates
>            Priority: Critical
>             Fix For: 0.14.1
>
>         Attachments: HIVE-8966.2.patch, HIVE-8966.patch
>
>
> hive hcatalog streaming will also create a file like bucket_n_flush_length in 
> each delta directory. Where "n" is the bucket number. But the 
> compactor.CompactorMR think this file also needs to compact. However this 
> file of course cannot be compacted, so compactor.CompactorMR will not 
> continue to do the compaction. 
> Did a test, after removed the bucket_n_flush_length file, then the "alter 
> table partition compact" finished successfully. If don't delete that file, 
> nothing will be compacted. 
> This is probably a very severity bug. Both 0.13 and 0.14 have this issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8966) Delta files created by hive hcatalog streaming cannot be compacted

Reply via email to