[ https://issues.apache.org/jira/browse/SOLR-15371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363163#comment-17363163 ]
Roy Perkins edited comment on SOLR-15371 at 6/14/21, 7:20 PM: -------------------------------------------------------------- Like I said, it can be hard to reproduce. It seems like it happens when the host I run the backup from is the leader for the shard of the collection that fails. Below is some output from my backup script: {code:java} { "responseHeader":{ "status":0, "QTime":0}, "success":{ "solrmulti03.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}}, "solrmulti08.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}}, "solrmulti01.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":4}}, "solrmulti04.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":14}}, "solrmulti04.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053395656 webapp=null path=/admin/cores params={core=search_shard2_replica_n4&async=100034112630053395656&qt=/admin/cores&name=shard2&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=14"}, "solrmulti03.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053446666 webapp=null path=/admin/cores params={core=search_shard3_replica_n29&async=100034112630053446666&qt=/admin/cores&name=shard3&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "solrmulti08.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053465731 webapp=null path=/admin/cores params={core=search_shard4_replica_n23&async=100034112630053465731&qt=/admin/cores&name=shard4&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}}, "100034112630053395656":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053395656 webapp=null path=/admin/cores params={core=search_shard2_replica_n4&async=100034112630053395656&qt=/admin/cores&name=shard2&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=14"}, "100034112630053446666":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053446666 webapp=null path=/admin/cores params={core=search_shard3_replica_n29&async=100034112630053446666&qt=/admin/cores&name=shard3&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "100034112630053465731":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053465731 webapp=null path=/admin/cores params={core=search_shard4_replica_n23&async=100034112630053465731&qt=/admin/cores&name=shard4&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "100034112630053492379":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"failed", "Response":"Failed to backup core=search_shard1_replica_n25 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/search/search-06-14-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"}, "failure":{ "solrmulti01.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"failed", "Response":"Failed to backup core=search_shard1_replica_n25 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/search/search-06-14-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"}}, "status":{ "state":"failed", "msg":"found [1000] in failed tasks"}} {code} was (Author: meltingrobot): Like I said, it can be hard to reproduce. It seems like it happens when the host I run the backup from is the leader for the shard of the collection that fails. Below is some output from my backup script: {noformat} { "responseHeader":{ "status":0, "QTime":0}, "success":{ "solrmulti03.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}}, "solrmulti08.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}}, "solrmulti01.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":4}}, "solrmulti04.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":14}}, "solrmulti04.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053395656 webapp=null path=/admin/cores params={core=search_shard2_replica_n4&async=100034112630053395656&qt=/admin/cores&name=shard2&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=14"}, "solrmulti03.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053446666 webapp=null path=/admin/cores params={core=search_shard3_replica_n29&async=100034112630053446666&qt=/admin/cores&name=shard3&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "solrmulti08.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053465731 webapp=null path=/admin/cores params={core=search_shard4_replica_n23&async=100034112630053465731&qt=/admin/cores&name=shard4&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}}, "100034112630053395656":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053395656 webapp=null path=/admin/cores params={core=search_shard2_replica_n4&async=100034112630053395656&qt=/admin/cores&name=shard2&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=14"}, "100034112630053446666":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053446666 webapp=null path=/admin/cores params={core=search_shard3_replica_n29&async=100034112630053446666&qt=/admin/cores&name=shard3&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "100034112630053465731":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"completed", "Response":"TaskId: 100034112630053465731 webapp=null path=/admin/cores params={core=search_shard4_replica_n23&async=100034112630053465731&qt=/admin/cores&name=shard4&action=BACKUPCORE&location=file:///mnt/solr_backups/search/search-06-14-2021&wt=javabin&version=2} status=0 QTime=0"}, "100034112630053492379":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"failed", "Response":"Failed to backup core=search_shard1_replica_n25 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/search/search-06-14-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"}, "failure":{ "solrmulti01.DOM.DOMAIN.com:8983_solr":{ "responseHeader":{ "status":0, "QTime":0}, "STATUS":"failed", "Response":"Failed to backup core=search_shard1_replica_n25 because org.apache.solr.common.SolrException: Directory to contain snapshots doesn't exist: file:///mnt/solr_backups/search/search-06-14-2021. Note that Backup/Restore of a SolrCloud collection requires a shared file system mounted at the same path on all nodes!"}}, "status":{ "state":"failed", "msg":"found [1000] in failed tasks"}} {noformat} > Backups randomly fail sometimes > ------------------------------- > > Key: SOLR-15371 > URL: https://issues.apache.org/jira/browse/SOLR-15371 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Backup/Restore > Affects Versions: 8.5.2, 8.8.2 > Reporter: Roy Perkins > Priority: Major > > Hi, we have an issue where sometimes one shard fails to backup due to what > might be a race condition in creating the folder/starting the backup. When > this happens, we have to restart the first server in a shard to get the > backup to succeed again. The cluster backs up to a shared NFS mount. 4/5 > times the backup goes fine without issues (there is even another collection > that the backup will run for later in the morning that will succeed fine even > though it's all the same servers) Below is the error I get. > {code:java} > "Response":"Failed to backup core=slprod_shard4_replica_n6 because > org.apache.solr.common.SolrException: Directory to contain snapshots doesn't > exist: file:///mnt/solr_backups/slprod/slprod-04-25-2021. Note that > Backup/Restore of a SolrCloud collection requires a shared file system > mounted at the same path on all nodes!"}, > {code} > And below is the line I use to backup with (obviously with bash variables set > earlier in the script) > {code:java} > curl -s > "http://localhost:8983/solr/admin/collections?action=BACKUP&name=${COLLECTION}-${DATE}&collection=${COLLECTION}&location=${BACKUP_PATH}&async=1000" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org