[ https://issues.apache.org/jira/browse/SOLR-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893024#comment-17893024 ]
David Smiley commented on SOLR-17497: ------------------------------------- I'm confused; is this one JIRA issue about two different exception? > Pull replicas throws AlreadyClosedException > --------------------------------------------- > > Key: SOLR-17497 > URL: https://issues.apache.org/jira/browse/SOLR-17497 > Project: Solr > Issue Type: Task > Reporter: Sanjay Dutt > Priority: Major > Attachments: Screenshot 2024-10-23 at 6.01.02 PM.png > > > Recently, a common exception (org.apache.lucene.store.AlreadyClosedException: > this Directory is closed) seen in multiple failed test cases. > FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica > FAILED: > org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull > FAILED: org.apache.solr.cloud.TestPullReplica.testAddDocs > > > {code:java} > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=10271, > name=fsyncService-6341-thread-1, state=RUNNABLE, > group=TGRP-SplitShardWithNodeRoleTest] > at > __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0) > Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is > closed > at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0) > at > app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50) > at > app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237) > at > app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214) > at > app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034) > at > app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803) > at > app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449) > at > java.base@11.0.24/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base@11.0.24/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base@11.0.24/java.lang.Thread.run(Thread.java:829) > {code} > > Interesting thing about these test cases is that they all share same kind of > setup where each has one shard and two replicas – one NRT and another is PULL. > > Going through one of the test case execution step. > FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica > > Test flow > 1. Create a collection with 1 NRT and 1 PULL replica > 2. waitForState > 3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all* > 4. Index another document. > 5. waitForNumDocsInAllActiveReplicas(1); > 6. Stop Pull replica > 7. Index another document > 8. waitForNumDocsInAllActiveReplicas(2); > 9. Start Pull Replica > 10. waitForState > 11. waitForNumDocsInAllActiveReplicas(2); > > As per the logs the whole sequence executed successfully. Here is the link to > the logs: > [https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output] > (link may stop working in the future) > > Last step where they are making sure that all the active replicas should have > two documents each has logged a info which is another proof that it completed > successfully. > > {code:java} > 616575 INFO > (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s: > r: x: t:] o.a.s.c.TestPullReplica Replica core_node3 > (https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/) > has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348) > [n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1 > r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2 > t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select > params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0 > 616607 INFO > (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s: > r: x: t:] o.a.s.c.TestPullReplica Replica core_node4 > (https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/) > has all 2 docs{code} > > *Where is the issue then?* > In the logs it has been observed, that after restarting the PULL replica. The > recovery process started and after fetching all the files info from the NRT, > the replication aborted and logged "User aborted replication" > > {code:java} > o.a.s.h.IndexFetcher User aborted Replication => > org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User > aborted replication at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826) > org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User > aborted replication{code} > > Inside IndexFetcher once it's aborted, It performs cleanup() operation which > do the closeup. And delete the resource only if the downloaded bytes not > equal to the size. > {code:java} > private void cleanup() { > try { > file.close(); > } catch (Exception e) { > /* no-op */ > log.error("Error closing file: {}", this.saveAs, e); > } > if (bytesDownloaded != size) { > // if the download is not complete then > // delete the file being downloaded > try { > file.delete(); > } catch (Exception e) { > log.error("Error deleting file: {}", this.saveAs, e); > } > // if the failure is due to a user abort it is returned normally else an > exception is thrown > if (!aborted) > throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to > download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" + > size); > } > }{code} > After which a sync operation is performed in a thread, and that's where it > fails. > {code:java} > fsyncService.execute( > () -> { > try { > file.sync(); > } catch (IOException e) { > fsyncException = e; > } > });{code} > Now two things: > 1. Why would replication is aborted in the first place? And who executes it? > 2. Should sync not be performed when the replication is aborted? > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org