[ https://issues.apache.org/jira/browse/HIVE-21197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
mahesh kumar behera updated HIVE-21197: --------------------------------------- Summary: Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled (was: Hive Replication can add duplicate data during migration Hive replication to a target with hive.strict.managed.tables enabled) > Hive Replication can add duplicate data during migration to a target with > hive.strict.managed.tables enabled > ------------------------------------------------------------------------------------------------------------ > > Key: HIVE-21197 > URL: https://issues.apache.org/jira/browse/HIVE-21197 > Project: Hive > Issue Type: Task > Components: repl > Reporter: mahesh kumar behera > Assignee: mahesh kumar behera > Priority: Major > > During bootstrap phase it may happen that the files copied to target are > created by events which are not part of the bootstrap. This is because of the > fact that, bootstrap first gets the last event id and then the file list. So > during this period if some event happens, then bootstrap will include files > created by these events also. So the same files will be copied again during > the first incremental replication just after the bootstrap. In normal > scenario, the duplicate copy does not cause any issue as hive allows the use > of target database only after the first incremental. But in case of > migration, the file at source and target are copied to different location > (based on the write id at target) and thus this may lead to duplicate data at > target. This can be avoided by having at check at load time for duplicate > file. This check can be done only for the first incremental and the search > can be done in the bootstrap directory (with write id 1). if the file is > already present then just ignore the copy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)