To address the problem, I've created the respective ticket on the Solr Jira: https://issues.apache.org/jira/browse/SOLR-17391
On Thu, 1 Aug 2024 at 18:13, Hakan Özler <ozler.ha...@gmail.com> wrote: > Hey Kevin, thank you for your interest in this subject. > > Was this change tested on a cloud that was also taking active ingest/query >> requests as the same time as the backup? > > > The test is completed in a SolrCloud 9.6.1 + the patch cluster managed by > the official Solr operator on Amazon EKS. The backup strategy is not > intended to happen frequently. Instead, we plan to take some backups for a > certain period of time, therefore we won't expect intense search traffic in > and out during backups. > > This performance is really exciting, but I'm curious how much burden it >> puts on CPU and memory. > > > I'd say that Solr was pretty relaxed during the test based on the CPU > usage. It looks like backup and restore are not a CPU intensive task. Each > node used only one core at a time. [2, 3] > > Also was this just taking a snapshot backup of the segment files or did >> this also include uploading to S3? > > > We're using the recommended backup functionality, where Solr uploads > everything to S3 [1] During backup and restore ops, the relevant metrics > looked like this: > > ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: > 5, > ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: > 5, > ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: > 5, > ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: > 5, > > While, without the patch, It indicated the following behavior: > > ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: > 0, > ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: > 5, > ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: > 1, > ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: > 1, > > Given that we have the patch, I believe we've returned to the old 9.2.1 > version. Setting the parameter to 1 could replicate the current 9.6.1 > version. Restore operations work well too. > Shall we take on this together? > > Hakan > > 1. > https://solr.apache.org/guide/solr/latest/deployment-guide/collection-management.html#backup > 2. https://imgur.com/a/iK9OFZh > 3. https://imgur.com/a/tSax2Cj > > On Wed, 31 Jul 2024 at 22:24, Kevin Liang (BLOOMBERG/ 919 3RD A) < > klian...@bloomberg.net> wrote: > >> Also was this just taking a snapshot backup of the segment files or did >> this also include uploading to S3? >> >> -Kevin >> >> From: users@solr.apache.org At: 07/31/24 15:22:58 UTC-4:00To: >> users@solr.apache.org >> Subject: Re: Significant Backup/Restore Performance Degradation for Large >> Collections >> >> Was this change tested on a cloud that was also taking active >> ingest/query >> requests as the same time as the backup? This performance is really >> exciting, >> but I'm curious how much burden it puts on CPU and memory. >> >> -Kevin >> >> From: users@solr.apache.org At: 07/31/24 12:55:33 UTC-4:00To: >> users@solr.apache.org >> Subject: Re: Significant Backup/Restore Performance Degradation for Large >> Collections >> >> Just a heads up, with the patch mentioned above, we managed to backup a >> data of 3TB in 50 minutes with `solr.maxExpensiveTaskThreads=5` [1] >> >> I would like to contribute to Solr, however, I'm unsure of the steps I >> should take if no one is available to take on this patch. >> >> 1. https://imgur.com/a/AAd0czU >> >> On Tue, 30 Jul 2024 at 16:53, Hakan Özler <ozler.ha...@gmail.com> wrote: >> >> > Hi!, >> > >> > We're experiencing performance issues in the recent Solr versions — >> 9.5.0 >> > and 9.6.1 — regarding backup and restore. In 9.2.1, we could take a >> backup >> > of 10TB data in just 1 and a half hours. Currently, as of 9.5.0, taking >> a >> > backup of the collection takes 7 hours! We're unable to make use of >> > disaster recovery effectively and reliably in Solr. Therefore, Solr >> 9.2.1 >> > still remains the most effective choice among the other 9.x versions for >> > our use. >> > >> > It seems that this is the ticket causing this issue: >> > 1. https://issues.apache.org/jira/browse/SOLR-16879 >> > >> > Interestingly, we never encountered a throttling problem during >> operations >> > when this was introduced to be solved based on this argument on 9.2.1. >> From >> > a devops perspective, we have some details and metrics on these tasks to >> > distinguish the difference between two versions. The overall IOPS was >> 150MB >> > on 9.6.1, while IOPS was 500MB on 9.2.1 during the same backup and >> restore >> > tasks. In the first image [1], the peak on the left represents a >> backup, in >> > contrast, in the 2nd image [2], the same backup operation in 9.5.0 uses >> > less resource. As you may spot, 9.5.0 seems to be using a fifth of the >> > resources of 9.2.1. >> > >> > Apart from that, monitoring some relevant metrics during the >> operations, I >> > had some difficulty interpreting the following metrics: >> > >> > >> "ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core": >> > 0, >> > >> "ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max": >> > 5, >> > >> "ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size": >> > 1, >> > >> "ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running": >> > 1, >> > >> > The pool size was 1 although the pool max size is 5. Shouldn't the pool >> > size be 5, instead? However, there is always one task running on a >> single >> > node, not 5 concurrently, if I'm not mistaken. >> > >> > I was also wondering if the max thread size, which is currently 5 in >> 9.4+, >> > could be configurable with either an environment variable or Java >> > parameter? The part that needs to be changed seems to be in >> > CoreAdminHandler.java on line 446 [3] I've made a small adjustment to >> add a >> > Solr parameter called `solr.maxExpensiveTaskThreads` for those who want >> to >> > set a different thread size for expensive tasks. The number given in >> this >> > parameter must meet the criteria of ThreadPoolExecutor, otherwise >> > IllegalArgumentException will occur. I've generated a patch [4] and I >> would >> > love to see if someone from the Solr committers would take on this and >> > apply for the upcoming release. Do you think our observation is accurate >> > and would this patch be feasible to implement? >> > >> > Thanks! >> > Hakan >> > >> > 1. https://i.imgur.com/aSrs8OM.png >> > 2. https://i.imgur.com/Yr6hBM8.png >> > 3. >> > >> >> https://github.com/apache/solr/commit/82a847f0f9af18d6eceee18743d636db7a879f3e#d >> iff-5bc3d44ca8b189f44fe9e6f75af8a5510463bdba79ff72a7d0ed190973a32533L446 >> <https://github.com/apache/solr/commit/82a847f0f9af18d6eceee18743d636db7a879f3e#diff-5bc3d44ca8b189f44fe9e6f75af8a5510463bdba79ff72a7d0ed190973a32533L446> >> > 4. https://gist.github.com/ozlerhakan/e4d11bddae6a2f89d2c212c220f4c965 >> > >> > >> >> >>