[jira] [Commented] (HIVE-13348) Add Event Nullification support for Replication

Sushanth Sowmyan (JIRA) Mon, 25 Apr 2016 13:11:37 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-13348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256953#comment-15256953
 ]


Sushanth Sowmyan commented on HIVE-13348:
-----------------------------------------

Sorry, to clarify, the idea is not to nullify the events in the main eventlog 
itself - we will still maintain those, and they are under the purview of the 
metastore currently - the idea is that when a program calls 
HCatClient.getReplicationTasks which exposes an Iterator<ReplicationTask>, 
where currently, there is a 1:1 map from Event to ReplicationTask, and we 
should ideally have a many-one.

Thus, this filtering would be downstream of the actual collection of events, it 
would be in-stream for the processing of replication events.

Or are you suggesting that even for replication, we should allow the capability 
to send along noop-replication-tasks as marker tasks for those events which 
were nullified, so we can have an audit on the destination? That could be done 
too, and would be performant as well.

> Add Event Nullification support for Replication
> -----------------------------------------------
>
>                 Key: HIVE-13348
>                 URL: https://issues.apache.org/jira/browse/HIVE-13348
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Import/Export
>            Reporter: Sushanth Sowmyan
>              Labels: gsoc2016
>
> Replication, as implemented by HIVE-7973 works as follows:
> a) For every singly modification to the hive metastore, an event gets 
> triggered that logs a notification object.
> b) Replication tools such as falcon can consume these notification objects as 
> a HCatReplicationTaskIterator from 
> HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName).
> c) For each event,  we generate statements and distcp requirements for falcon 
> to export, distcp and import to do the replication (along with requisite 
> changes to export and import that would allow state management).
> The big thing missing from this picture is that while it works, it is pretty 
> dumb about how it works in that it will exhaustively process every single 
> event generated, and will try to do the export-distcp-import cycle for all 
> modifications, irrespective of whether or not that will actually get used at 
> import time.
> We need to build some sort of filtering logic which can process a batch of 
> events to identify events that will result in effective no-ops, and to 
> nullify those events from the stream before passing them on. The goal is to 
> minimize the number of events that the tools like Falcon would actually have 
> to process.
> Examples of cases where event nullification would take place:
> a) CREATE-DROP cases: If an object is being created in event#34 that will 
> eventually get dropped in event#47, then there is no point in replicating 
> this along. We simply null out both these events, and also, any other event 
> that references this object between event#34 and event#47.
> b) APPEND-APPEND : Some objects are replicated wholesale, which means every 
> APPEND that occurs would cause a full export of the object in question. At 
> this point, the prior APPENDS would all be supplanted by the last APPEND. 
> Thus, we could nullify all the prior such events. 
> Additional such cases can be inferred by analysis of the Export-Import relay 
> protocol definition at 
> https://issues.apache.org/jira/secure/attachment/12725999/EXIMReplicationReplayProtocol.pdf
>  or by reasoning out various event processing orders possible.
> Replication, as implemented by HIVE-7973 is merely a first step for 
> functional support. This work is needed for replication to be efficient at 
> all, and thus, usable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-13348) Add Event Nullification support for Replication

Reply via email to