keith-turner commented on code in PR #5416:
URL: https://github.com/apache/accumulo/pull/5416#discussion_r2021879312


##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -568,10 +573,13 @@ private class MigrationCleanupThread implements Runnable {
     @Override
     public void run() {
       while (stillManager()) {
-        if (!migrations.isEmpty()) {
+        // migrations are stored in the metadata tables which cannot be read 
until the
+        // TabletGroupWatchers are started
+        if (watchersStarted.get() && numMigrations() > 0) {

Review Comment:
   Good point.  In general the dependency on the tablet group watcher should 
not matter.  A thread will block trying to scan the metadata table until the 
TGW assigns the tablets for the metadata.  However it may matter in this case 
if the scan blocking prevents work from being done that the TGW depends on.  
That may be the case here because there is a single thread that is scanning all 
three data levels to clean up migrations.  So if the user level is not online, 
then this thread could get stuck and will not be able to clean up the root and 
metadata levels.  If the migrations in the root level must be cleaned up in 
order for the user level to be assigned by the TGW, then we may end dead locked 
where the system can never make progress.  Not sure if this can happen, I need 
to look at the overall code some more.  I am wondering if a situation like the 
following happen in the current code?
   
    * There is migration to a dead tserver in the root tablet that needs to be 
cleaned up before the metadata table can be assigned.
    * The migration cleanup code is stuck trying to scan the metadata table and 
therefore will not clean up the root table migration to a dead tserver. 
   
   If that can happen, the simplest fix would be to run a migration cleanup 
thread for each data level. This would break the deadlock caused by a single 
thread doing all the levels.  Another possible solution would be to work 
migration clean up into the TGW workflow, but that would probably be best as 
its own PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to