[GitHub] [kafka] vvcephei commented on a change in pull request #8248: KAFKA-9501: convert between active and standby without closing stores

GitBox Wed, 27 May 2020 14:57:09 -0700


vvcephei commented on a change in pull request #8248:
URL: https://github.com/apache/kafka/pull/8248#discussion_r431251657




##########
File path: 
streams/src/main/java/org/apache/kafka/streams/processor/internals/StandbyTask.java
##########
@@ -189,38 +190,52 @@ private void prepareClose(final boolean clean) {
     @Override
     public void closeClean(final Map<TopicPartition, Long> checkpoint) {
         Objects.requireNonNull(checkpoint);
-        close(true);
+        close(true, false);
 
         log.info("Closed clean");
     }
 
     @Override
     public void closeDirty() {
-        close(false);
+        close(false, false);
 
         log.info("Closed dirty");
     }
 
-    private void close(final boolean clean) {
+    @Override
+    public void closeAndRecycleState() {
+        prepareClose(true);
+        close(true, true);

Review comment:
       When I introduced `closeClean` and `closeDirty`, I resisted the urge to 
inline `close(boolean)` only to control the LOC of the change. Having more 
branches and flags is generally more of a liability than a few duplicated 
statements.
   
   Now that we have two boolean flags (again) and new branch in the internal 
`close` method, I'd be much more inclined to inline it. But we can do this in a 
follow-on PR, if you prefer.

##########
File path: 
streams/src/main/java/org/apache/kafka/streams/processor/internals/StoreChangelogReader.java
##########
@@ -811,17 +812,41 @@ private void prepareChangelogs(final 
Set<ChangelogMetadata> newPartitionsToResto
         }
     }
 
+    private RuntimeException invokeOnRestoreEnd(final TopicPartition partition,
+                                                final ChangelogMetadata 
changelogMetadata) {
+        // only trigger the store's specific listener to make sure we disable 
bulk loading before transition to standby
+        final StateStoreMetadata storeMetadata = 
changelogMetadata.storeMetadata;
+        final StateRestoreCallback restoreCallback = 
storeMetadata.restoreCallback();
+        final String storeName = storeMetadata.store().name();
+        if (restoreCallback instanceof StateRestoreListener) {
+            try {
+                ((StateRestoreListener) 
restoreCallback).onRestoreEnd(partition, storeName, 
changelogMetadata.totalRestored);
+            } catch (final RuntimeException e) {
+                return e;
+            }
+        }
+        return null;
+    }
+
     @Override
-    public void remove(final Collection<TopicPartition> revokedChangelogs) {
-        // Only changelogs that are initialized that been added to the restore 
consumer's assignment
+    public void unregister(final Collection<TopicPartition> revokedChangelogs,
+                           final boolean triggerOnRestoreEnd) {
+        final AtomicReference<RuntimeException> firstException = new 
AtomicReference<>(null);
+
+        // Only changelogs that are initialized have been added to the restore 
consumer's assignment
         final List<TopicPartition> revokedInitializedChangelogs = new 
ArrayList<>();
 
         for (final TopicPartition partition : revokedChangelogs) {
             final ChangelogMetadata changelogMetadata = 
changelogs.remove(partition);
             if (changelogMetadata != null) {
-                if (changelogMetadata.state() != ChangelogState.REGISTERED) {
+                if (triggerOnRestoreEnd && 
changelogMetadata.state().equals(ChangelogState.RESTORING)) {

Review comment:
       I think this makes sense. The restore listener / bulk loading 
interaction is a bit wacky, but it seems reasonable to just work around it for 
now.
   
   Just to play devil's advocate briefly, though, is it not true for _all_ 
listeners that the restore has ended, for exactly the reason you cited above?

##########
File path: 
streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskManager.java
##########
@@ -272,6 +274,30 @@ public void handleAssignment(final Map<TaskId, 
Set<TopicPartition>> activeTasks,
             }
         }
 
+        if (taskCloseExceptions.isEmpty()) {
+            for (final Task oldTask : tasksToRecycle) {
+                final Task newTask;
+                try {
+                    if (oldTask.isActive()) {
+                        final Set<TopicPartition> partitions = 
standbyTasksToCreate.remove(oldTask.id());
+                        newTask = 
standbyTaskCreator.createStandbyTaskFromActive((StreamTask) oldTask, 
partitions);
+                    } else {
+                        final Set<TopicPartition> partitions = 
activeTasksToCreate.remove(oldTask.id());
+                        newTask = 
activeTaskCreator.createActiveTaskFromStandby((StandbyTask) oldTask, 
partitions, mainConsumer);
+                    }
+                    tasks.remove(oldTask.id());
+                    addNewTask(newTask);
+                } catch (final RuntimeException e) {
+                    final String uncleanMessage = String.format("Failed to 
recycle task %s cleanly. Attempting to close remaining tasks before 
re-throwing:", oldTask.id());
+                    log.error(uncleanMessage, e);
+                    taskCloseExceptions.put(oldTask.id(), e);
+                    dirtyTasks.add(oldTask);
+                }
+            }
+        } else {
+            dirtyTasks.addAll(tasksToRecycle);
+        }

Review comment:
       This seems like an odd case. Am I right in reading this as, "something 
went wrong, and we don't know what it was, so we're just going to assume the 
worst and dump all the tasks that we were hoping to recycle"?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] vvcephei commented on a change in pull request #8248: KAFKA-9501: convert between active and standby without closing stores

Reply via email to