keith-turner commented on code in PR #5416:
URL: https://github.com/apache/accumulo/pull/5416#discussion_r2021853863


##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -987,67 +928,71 @@ private long balanceTablets() {
       final Map<DataLevel,Set<KeyExtent>> partitionedMigrations = 
partitionMigrations();
       int levelsCompleted = 0;
 
-      for (DataLevel dl : DataLevel.values()) {
-        if (dl == DataLevel.USER && tabletsNotHosted > 0) {
-          log.debug("not balancing user tablets because there are {} unhosted 
tablets",
-              tabletsNotHosted);
-          continue;
-        }
-
-        if ((dl == DataLevel.METADATA || dl == DataLevel.USER)
-            && !partitionedMigrations.get(DataLevel.ROOT).isEmpty()) {
-          log.debug("Not balancing {} because {} has migrations", dl, 
DataLevel.ROOT);
-          continue;
-        }
+      try (var tabletsMutator = getContext().getAmple().mutateTablets()) {

Review Comment:
   When switching this to conditional mutations, will need to do the writes per 
data level because ample does not support writing conditional mutations to 
different data levels at the same time.



##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -568,67 +570,36 @@ private class MigrationCleanupThread implements Runnable {
     @Override
     public void run() {
       while (stillManager()) {
-        if (!migrations.isEmpty()) {
-          try {
-            cleanupOfflineMigrations();
-            cleanupNonexistentMigrations(getContext());
-          } catch (Exception ex) {
-            log.error("Error cleaning up migrations", ex);
+        try {
+          // - Remove any migrations for tablets of offline tables, as the 
migration can never
+          // succeed because no tablet server will load the tablet
+          // - Remove any migrations to tablets that are not live

Review Comment:
   ```suggestion
             // - Remove any migrations to tablet servers that are not live
   ```



##########
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java:
##########
@@ -568,10 +573,13 @@ private class MigrationCleanupThread implements Runnable {
     @Override
     public void run() {
       while (stillManager()) {
-        if (!migrations.isEmpty()) {
+        // migrations are stored in the metadata tables which cannot be read 
until the
+        // TabletGroupWatchers are started
+        if (watchersStarted.get() && numMigrations() > 0) {

Review Comment:
   Good point.  In general the dependency on the tablet group watcher should 
not matter.  A thread will block trying to scan the metadata table until the 
TGW assigns the tablets for the metadata.  However it may matter in this case 
if the scan blocking prevents work from being done that the TGW depends on.  
That may be the case here because there is single thread is scanning all three 
data levels to clean up migrations.  So if the user level is not online, then 
this thread could get stuck and will not be able to clean up the root and 
metadata levels.  If the migrations in the root level must be cleaned up in 
order for the user level to be assigned by the TGW, then we may end dead locked 
where the system can never make progress.  Not sure if this can happen, I need 
to look at the overall code some more.  I am wondering if a situation like the 
following happen in the current code?
   
    * There is migration to a dead tserver in the root tablet that needs to be 
cleaned up before the metadata table can be assigned.
    * The migration cleanup code is stuck trying to scan the metadata table and 
therefore will not clean up the root table migration to a dead tserver. 
   
   If that can happen, the simplest fix would be to run a migration cleanup 
thread for each data level. This would break the deadlock caused by a single 
thread doing all the levels.  Another possible solution would be to work 
migration clean up into the TGW workflow, but that would probably be best as 
its own PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to