[ https://issues.apache.org/jira/browse/HIVE-13348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256953#comment-15256953 ]
Sushanth Sowmyan commented on HIVE-13348: ----------------------------------------- Sorry, to clarify, the idea is not to nullify the events in the main eventlog itself - we will still maintain those, and they are under the purview of the metastore currently - the idea is that when a program calls HCatClient.getReplicationTasks which exposes an Iterator<ReplicationTask>, where currently, there is a 1:1 map from Event to ReplicationTask, and we should ideally have a many-one. Thus, this filtering would be downstream of the actual collection of events, it would be in-stream for the processing of replication events. Or are you suggesting that even for replication, we should allow the capability to send along noop-replication-tasks as marker tasks for those events which were nullified, so we can have an audit on the destination? That could be done too, and would be performant as well. > Add Event Nullification support for Replication > ----------------------------------------------- > > Key: HIVE-13348 > URL: https://issues.apache.org/jira/browse/HIVE-13348 > Project: Hive > Issue Type: Sub-task > Components: Import/Export > Reporter: Sushanth Sowmyan > Labels: gsoc2016 > > Replication, as implemented by HIVE-7973 works as follows: > a) For every singly modification to the hive metastore, an event gets > triggered that logs a notification object. > b) Replication tools such as falcon can consume these notification objects as > a HCatReplicationTaskIterator from > HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName). > c) For each event, we generate statements and distcp requirements for falcon > to export, distcp and import to do the replication (along with requisite > changes to export and import that would allow state management). > The big thing missing from this picture is that while it works, it is pretty > dumb about how it works in that it will exhaustively process every single > event generated, and will try to do the export-distcp-import cycle for all > modifications, irrespective of whether or not that will actually get used at > import time. > We need to build some sort of filtering logic which can process a batch of > events to identify events that will result in effective no-ops, and to > nullify those events from the stream before passing them on. The goal is to > minimize the number of events that the tools like Falcon would actually have > to process. > Examples of cases where event nullification would take place: > a) CREATE-DROP cases: If an object is being created in event#34 that will > eventually get dropped in event#47, then there is no point in replicating > this along. We simply null out both these events, and also, any other event > that references this object between event#34 and event#47. > b) APPEND-APPEND : Some objects are replicated wholesale, which means every > APPEND that occurs would cause a full export of the object in question. At > this point, the prior APPENDS would all be supplanted by the last APPEND. > Thus, we could nullify all the prior such events. > Additional such cases can be inferred by analysis of the Export-Import relay > protocol definition at > https://issues.apache.org/jira/secure/attachment/12725999/EXIMReplicationReplayProtocol.pdf > or by reasoning out various event processing orders possible. > Replication, as implemented by HIVE-7973 is merely a first step for > functional support. This work is needed for replication to be efficient at > all, and thus, usable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)