[ 
https://issues.apache.org/jira/browse/SOLR-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated SOLR-16043:
--------------------------------
    Description: 
This is caused by SOLR-15942. Some background:
* originally discussed in 3.3.1 PR - 
https://github.com/apache/solr/pull/553#discussion_r792068327
* Failures so far from jenkins
** https://jenkins.thetaphi.de/job/Solr-main-Linux/3116/consoleText
** https://jenkins.thetaphi.de/job/Solr-main-Linux/3109/consoleText
** https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText

It looks like it might be trying to shut down:

{code:java}
 2> 31835 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor 
encountered interrupt and exit.
  2> 31835 WARN  (BP-1491942626-127.0.0.1-1645493532707 heartbeating to 
localhost/127.0.0.1:51572) [] o.a.h.h.s.d.IncrementalBlockReportManager 
IncrementalBlockReportManager interrupted
  2> 31835 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command 
processor service for: Thread[Command 
processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
{code}

the error message looks like (this from :

{code:java}
   2> 31948 WARN  (Listener at localhost/51675) [] 
o.a.h.h.s.d.f.i.FsDatasetAsyncDiskService AsyncDiskService has already shut 
down.
  2> 31948 WARN  (Listener at localhost/51675) [] 
o.a.h.h.s.d.f.i.RamDiskAsyncLazyPersistService AsyncLazyPersistService has 
already shut down.
  2> 31948 INFO  (Listener at localhost/51675) [] o.a.h.h.s.d.DataNode Shutdown 
complete.
  2> 32290 INFO  (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter Closing 
ErrorLogMuter-regex-1 after mutting 0 log messages
  2> 32290 INFO  (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter 
Creating ErrorLogMuter-regex-2 for ERROR logs matching regex: ignore_exception
  2> Feb 22, 2022 1:32:34 AM 
com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> WARNING: Will linger awaiting termination of 1 leaked thread(s).
  2> Feb 22, 2022 1:32:35 AM 
com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> SEVERE: 1 thread leaked from SUITE scope at 
org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest: 
  2>    1) Thread[id=90, name=Command processor, state=WAITING, 
group=TGRP-HdfsBackupRepositoryIntegrationTest]
  2>         at java.base@16.0.2/jdk.internal.misc.Unsafe.park(Native Method)
  2>         at 
java.base@16.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
  2>         at 
java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
  2>         at 
java.base@16.0.2/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
  2>         at 
java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
  2>         at 
java.base@16.0.2/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
  2>         at 
app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
  2>         at 
app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
  2> Feb 22, 2022 1:32:35 AM 
com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
  2> INFO: Starting to interrupt leaked threads:
  2>    1) Thread[id=90, name=Command processor, state=WAITING, 
group=TGRP-HdfsBackupRepositoryIntegrationTest]
  2> 33504 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor 
encountered interrupt and exit.
  2> 33504 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command 
processor service for: Thread[Command 
processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
  2> Feb 22, 2022 1:32:35 AM 
com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
  2> INFO: All leaked threads terminated.
   >     com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked 
from SUITE scope at 
org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest: 
   >        1) Thread[id=90, name=Command processor, state=WAITING, 
group=TGRP-HdfsBackupRepositoryIntegrationTest]
   >             at java.base@16.0.2/jdk.internal.misc.Unsafe.park(Native 
Method)
   >             at 
java.base@16.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
   >             at 
java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
   >             at 
java.base@16.0.2/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
   >             at 
java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
   >             at 
java.base@16.0.2/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
   >             at 
app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
   >             at 
app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
   >         at __randomizedtesting.SeedInfo.seed([1A5FAFA6CE7F7FF0]:0)
  2> NOTE: test params are: codec=Asserting(Lucene90): {}, docValues:{}, 
maxPointsInLeafNode=757, maxMBSortInHeap=5.99773708873607, 
sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=sah-RU, 
timezone=Asia/Macau
  2> NOTE: Mac OS X 10.14.6 x86_64/Eclipse Foundation 16.0.2 
(64-bit)/cpus=6,threads=40,free=159527368,total=271581184
  2> NOTE: All tests run in this JVM: [HdfsBackupRepositoryIntegrationTest]
  2> NOTE: reproduce with: gradlew test --tests 
HdfsBackupRepositoryIntegrationTest -Dtests.seed=1A5FAFA6CE7F7FF0 
-Dtests.slow=true -Dtests.locale=sah-RU -Dtests.timezone=Asia/Macau 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
{code}

Some related code pointers:
* 
https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java#L1336
* 
https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L2017

PS: there is an error about datanode mbean - but it is unrelated as far as I 
can tell - https://issues.apache.org/jira/browse/HDFS-11041


  was:
This is caused by SOLR-15942. Some background:
* originally discussed in 3.3.1 PR - 
https://github.com/apache/solr/pull/553#discussion_r792068327
* Failures so far from jenkins
** https://jenkins.thetaphi.de/job/Solr-main-Linux/3116/consoleText
** https://jenkins.thetaphi.de/job/Solr-main-Linux/3109/consoleText
** https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText

It looks like it might be trying to shut down:

{code:java}
 2> 31835 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor 
encountered interrupt and exit.
  2> 31835 WARN  (BP-1491942626-127.0.0.1-1645493532707 heartbeating to 
localhost/127.0.0.1:51572) [] o.a.h.h.s.d.IncrementalBlockReportManager 
IncrementalBlockReportManager interrupted
  2> 31835 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command 
processor service for: Thread[Command 
processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
{code}

the error message is:

{code:java}
  2> SEVERE: 1 thread leaked from SUITE scope at 
org.apache.solr.handler.TestHdfsBackupRestoreCore: 
  2>    1) Thread[id=874, name=Command processor, state=WAITING, 
group=TGRP-TestHdfsBackupRestoreCore]
  2>         at java.base@16.0.2/jdk.internal.misc.Unsafe.park(Native Method)
  2>         at 
java.base@16.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
  2>         at 
java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
  2>         at 
java.base@16.0.2/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
  2>         at 
java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
  2>         at 
java.base@16.0.2/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
  2>         at 
app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
  2>         at 
app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
  2> Feb 22, 2022 12:51:04 AM 
com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
  2> INFO: Starting to interrupt leaked threads:
  2>    1) Thread[id=874, name=Command processor, state=WAITING, 
group=TGRP-TestHdfsBackupRestoreCore]
  2> 46271 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor 
encountered interrupt and exit.
  2> 46272 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command 
processor service for: Thread[Command 
processor,5,TGRP-TestHdfsBackupRestoreCore]
  2> Feb 22, 2022 12:51:04 AM 
com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
  2> INFO: All leaked threads terminated.
   >     com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked 
from SUITE scope at org.apache.solr.handler.TestHdfsBackupRestoreCore: 
   >        1) Thread[id=874, name=Command processor, state=WAITING, 
group=TGRP-TestHdfsBackupRestoreCore]
   >             at java.base@16.0.2/jdk.internal.misc.Unsafe.park(Native 
Method)
   >             at 
java.base@16.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
   >             at 
java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
   >             at 
java.base@16.0.2/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
   >             at 
java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
   >             at 
java.base@16.0.2/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
   >             at 
app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
   >             at 
app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
   >         at __randomizedtesting.SeedInfo.seed([239C064719AD54A3]:0)
  2> NOTE: test params are: codec=Asserting(Lucene90): {_root_=Lucene90, 
name=PostingsFormat(name=MockRandom), id=PostingsFormat(name=MockRandom)}, 
docValues:{}, maxPointsInLeafNode=2008, maxMBSortInHeap=7.425444076012971, 
sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=bs-Cyrl-BA, 
timezone=America/Argentina/San_Luis
  2> NOTE: Linux 5.11.0-41-generic amd64/Eclipse Foundation 16.0.2 
(64-bit)/cpus=16,threads=132,free=90931432,total=259981312
  2> NOTE: All tests run in this JVM: [HdfsCollectionsAPIDistributedZkTest, 
HDFSCollectionsAPITest, HdfsTlogReplayBufferedWhileIndexingTest, 
MoveReplicaHDFSTest, TestHdfsBackupRestoreCore]
  2> NOTE: reproduce with: gradlew test --tests TestHdfsBackupRestoreCore 
-Dtests.seed=239C064719AD54A3 -Dtests.multiplier=3 -Dtests.slow=true 
-Dtests.locale=bs-Cyrl-BA -Dtests.timezone=America/Argentina/San_Luis 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
{code}

Some related code pointers:
* 
https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java#L1336
* 
https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L2017

PS: there is an error about datanode mbean - but it is unrelated as far as I 
can tell - https://issues.apache.org/jira/browse/HDFS-11041



> HDFS tests - "Command processor" thread leak
> --------------------------------------------
>
>                 Key: SOLR-16043
>                 URL: https://issues.apache.org/jira/browse/SOLR-16043
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: hdfs, Tests
>            Reporter: Kevin Risden
>            Priority: Major
>         Attachments: Solr-main-MACOSX-678_consoleText.txt.gz
>
>
> This is caused by SOLR-15942. Some background:
> * originally discussed in 3.3.1 PR - 
> https://github.com/apache/solr/pull/553#discussion_r792068327
> * Failures so far from jenkins
> ** https://jenkins.thetaphi.de/job/Solr-main-Linux/3116/consoleText
> ** https://jenkins.thetaphi.de/job/Solr-main-Linux/3109/consoleText
> ** https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText
> It looks like it might be trying to shut down:
> {code:java}
>  2> 31835 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor 
> encountered interrupt and exit.
>   2> 31835 WARN  (BP-1491942626-127.0.0.1-1645493532707 heartbeating to 
> localhost/127.0.0.1:51572) [] o.a.h.h.s.d.IncrementalBlockReportManager 
> IncrementalBlockReportManager interrupted
>   2> 31835 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command 
> processor service for: Thread[Command 
> processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
> {code}
> the error message looks like (this from :
> {code:java}
>    2> 31948 WARN  (Listener at localhost/51675) [] 
> o.a.h.h.s.d.f.i.FsDatasetAsyncDiskService AsyncDiskService has already shut 
> down.
>   2> 31948 WARN  (Listener at localhost/51675) [] 
> o.a.h.h.s.d.f.i.RamDiskAsyncLazyPersistService AsyncLazyPersistService has 
> already shut down.
>   2> 31948 INFO  (Listener at localhost/51675) [] o.a.h.h.s.d.DataNode 
> Shutdown complete.
>   2> 32290 INFO  (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter 
> Closing ErrorLogMuter-regex-1 after mutting 0 log messages
>   2> 32290 INFO  (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter 
> Creating ErrorLogMuter-regex-2 for ERROR logs matching regex: ignore_exception
>   2> Feb 22, 2022 1:32:34 AM 
> com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
>   2> WARNING: Will linger awaiting termination of 1 leaked thread(s).
>   2> Feb 22, 2022 1:32:35 AM 
> com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
>   2> SEVERE: 1 thread leaked from SUITE scope at 
> org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest: 
>   2>    1) Thread[id=90, name=Command processor, state=WAITING, 
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
>   2>         at java.base@16.0.2/jdk.internal.misc.Unsafe.park(Native Method)
>   2>         at 
> java.base@16.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
>   2>         at 
> java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
>   2>         at 
> java.base@16.0.2/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
>   2>         at 
> java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
>   2>         at 
> java.base@16.0.2/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
>   2>         at 
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
>   2>         at 
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
>   2> Feb 22, 2022 1:32:35 AM 
> com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
>   2> INFO: Starting to interrupt leaked threads:
>   2>    1) Thread[id=90, name=Command processor, state=WAITING, 
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
>   2> 33504 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command 
> processor encountered interrupt and exit.
>   2> 33504 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command 
> processor service for: Thread[Command 
> processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
>   2> Feb 22, 2022 1:32:35 AM 
> com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
>   2> INFO: All leaked threads terminated.
>    >     com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked 
> from SUITE scope at 
> org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest: 
>    >        1) Thread[id=90, name=Command processor, state=WAITING, 
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
>    >             at java.base@16.0.2/jdk.internal.misc.Unsafe.park(Native 
> Method)
>    >             at 
> java.base@16.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
>    >             at 
> java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
>    >             at 
> java.base@16.0.2/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
>    >             at 
> java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
>    >             at 
> java.base@16.0.2/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
>    >             at 
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
>    >             at 
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
>    >         at __randomizedtesting.SeedInfo.seed([1A5FAFA6CE7F7FF0]:0)
>   2> NOTE: test params are: codec=Asserting(Lucene90): {}, docValues:{}, 
> maxPointsInLeafNode=757, maxMBSortInHeap=5.99773708873607, 
> sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=sah-RU, 
> timezone=Asia/Macau
>   2> NOTE: Mac OS X 10.14.6 x86_64/Eclipse Foundation 16.0.2 
> (64-bit)/cpus=6,threads=40,free=159527368,total=271581184
>   2> NOTE: All tests run in this JVM: [HdfsBackupRepositoryIntegrationTest]
>   2> NOTE: reproduce with: gradlew test --tests 
> HdfsBackupRepositoryIntegrationTest -Dtests.seed=1A5FAFA6CE7F7FF0 
> -Dtests.slow=true -Dtests.locale=sah-RU -Dtests.timezone=Asia/Macau 
> -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> {code}
> Some related code pointers:
> * 
> https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java#L1336
> * 
> https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L2017
> PS: there is an error about datanode mbean - but it is unrelated as far as I 
> can tell - https://issues.apache.org/jira/browse/HDFS-11041



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to