[ https://issues.apache.org/jira/browse/SOLR-15371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363861#comment-17363861 ]
Jason Gerlowski commented on SOLR-15371: ---------------------------------------- bq. I may have forgot to mention, but the directory does not get created for the failing shard This makes sense I think. The directory corresponding to "location+name" ( {{file:///mnt/solr-backups/search/search-06-14-2021}} in your case) is created by the overseer node [here|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/api/collections/BackupCmd.java#L97] - right before Solr sends any core-level requests to backup the individual cores. So that location+name base dir should definitely exist - if the {{createDirectory}} call failed then Solr wouldn't've made any of the per-shard requests. And it also makes sense that the core-backups that finish successfully are able to create sub-trees of their own backed up files within that directory. I can only think of two things that'd cause the behavior your seeing: # There is some network blip with your NFS that you don't know about. Or maybe your NFS isn't set up to guarantee absolutely up-to-date information in terms of file existence/contents. But you said you've done your due-diligence in checking for errors of that sort. # If {{LocalFileSystemRepository.createDirectory}} is synchronous, then it's impossible for code triggered after that call to report the dir missing. But if {{LocalFileSystemRepository.createDirectory}} is asynchronous in some way it's easier to imagine this happening. LFSR.createDirectory is implemented by a passthrough to [java.nio.file.Files.createDirectory|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/file/Files.html#createDirectory(java.nio.file.Path,java.nio.file.attribute.FileAttribute...)] - which is well documented and says nothing about asynchronicity. I know the nio library does support some asynchronous operations so maybe this is one of those - but most of the information I'm finding online says that directory operations are synchronous. Something to dig into I guess. > Backups randomly fail sometimes > ------------------------------- > > Key: SOLR-15371 > URL: https://issues.apache.org/jira/browse/SOLR-15371 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Backup/Restore > Affects Versions: 8.5.2, 8.8.2 > Reporter: Roy Perkins > Priority: Major > > Hi, we have an issue where sometimes one shard fails to backup due to what > might be a race condition in creating the folder/starting the backup. When > this happens, we have to restart the first server in a shard to get the > backup to succeed again. The cluster backs up to a shared NFS mount. 4/5 > times the backup goes fine without issues (there is even another collection > that the backup will run for later in the morning that will succeed fine even > though it's all the same servers) Below is the error I get. > {code:java} > "Response":"Failed to backup core=slprod_shard4_replica_n6 because > org.apache.solr.common.SolrException: Directory to contain snapshots doesn't > exist: file:///mnt/solr_backups/slprod/slprod-04-25-2021. Note that > Backup/Restore of a SolrCloud collection requires a shared file system > mounted at the same path on all nodes!"}, > {code} > And below is the line I use to backup with (obviously with bash variables set > earlier in the script) > {code:java} > curl -s > "http://localhost:8983/solr/admin/collections?action=BACKUP&name=${COLLECTION}-${DATE}&collection=${COLLECTION}&location=${BACKUP_PATH}&async=1000" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org