Re: Significant Backup/Restore Performance Degradation for Large Collections

Hakan Özler Mon, 05 Aug 2024 06:30:05 -0700

To address the problem, I've created the respective ticket on the Solr
Jira: https://issues.apache.org/jira/browse/SOLR-17391


On Thu, 1 Aug 2024 at 18:13, Hakan Özler <ozler.ha...@gmail.com> wrote:

> Hey Kevin, thank you for your interest in this subject.
>
> Was this change tested on a cloud that was also taking active ingest/query
>> requests as the same time as the backup?
>
>
> The test is completed in a SolrCloud 9.6.1 + the patch cluster managed by
> the official Solr operator on Amazon EKS. The backup strategy is not
> intended to happen frequently. Instead, we plan to take some backups for a
> certain period of time, therefore we won't expect intense search traffic in
> and out during backups.
>
> This performance is really exciting, but I'm curious how much burden it
>> puts on CPU and memory.
>
>
> I'd say that Solr was pretty relaxed during the test based on the CPU
> usage. It looks like backup and restore are not a CPU intensive task. Each
> node used only one core at a time. [2, 3]
>
> Also was this just taking a snapshot backup of the segment files or did
>> this also include uploading to S3?
>
>
> We're using the recommended backup functionality, where Solr uploads
> everything to S3 [1] During backup and restore ops, the relevant metrics
> looked like this:
>
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core:
> 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max:
> 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size:
> 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running:
> 5,
>
> While, without the patch, It indicated the following behavior:
>
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core:
> 0,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max:
> 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size:
> 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running:
> 1,
>
> Given that we have the patch, I believe we've returned to the old 9.2.1
> version. Setting the parameter to 1 could replicate the current 9.6.1
> version. Restore operations work well too.
> Shall we take on this together?
>
> Hakan
>
> 1.
> https://solr.apache.org/guide/solr/latest/deployment-guide/collection-management.html#backup
> 2. https://imgur.com/a/iK9OFZh
> 3. https://imgur.com/a/tSax2Cj
>
> On Wed, 31 Jul 2024 at 22:24, Kevin Liang (BLOOMBERG/ 919 3RD A) <
> klian...@bloomberg.net> wrote:
>
>> Also was this just taking a snapshot backup of the segment files or did
>> this also include uploading to S3?
>>
>> -Kevin
>>
>> From: users@solr.apache.org At: 07/31/24 15:22:58 UTC-4:00To:
>> users@solr.apache.org
>> Subject: Re: Significant Backup/Restore Performance Degradation for Large
>> Collections
>>
>> Was this change tested on a cloud that was also taking active
>> ingest/query
>> requests as the same time as the backup? This performance is really
>> exciting,
>> but I'm curious how much burden it puts on CPU and memory.
>>
>> -Kevin
>>
>> From: users@solr.apache.org At: 07/31/24 12:55:33 UTC-4:00To:
>> users@solr.apache.org
>> Subject: Re: Significant Backup/Restore Performance Degradation for Large
>> Collections
>>
>> Just a heads up, with the patch mentioned above, we managed to backup a
>> data of 3TB in 50 minutes with `solr.maxExpensiveTaskThreads=5` [1]
>>
>> I would like to contribute to Solr, however, I'm unsure of the steps I
>> should take if no one is available to take on this patch.
>>
>> 1. https://imgur.com/a/AAd0czU
>>
>> On Tue, 30 Jul 2024 at 16:53, Hakan Özler <ozler.ha...@gmail.com> wrote:
>>
>> > Hi!,
>> >
>> > We're experiencing performance issues in the recent Solr versions —
>> 9.5.0
>> > and 9.6.1 — regarding backup and restore. In 9.2.1, we could take a
>> backup
>> > of 10TB data in just 1 and a half hours. Currently, as of 9.5.0, taking
>> a
>> > backup of the collection takes 7 hours! We're unable to make use of
>> > disaster recovery effectively and reliably in Solr. Therefore, Solr
>> 9.2.1
>> > still remains the most effective choice among the other 9.x versions for
>> > our use.
>> >
>> > It seems that this is the ticket causing this issue:
>> > 1. https://issues.apache.org/jira/browse/SOLR-16879
>> >
>> > Interestingly, we never encountered a throttling problem during
>> operations
>> > when this was introduced to be solved based on this argument on 9.2.1.
>> From
>> > a devops perspective, we have some details and metrics on these tasks to
>> > distinguish the difference between two versions. The overall IOPS was
>> 150MB
>> > on 9.6.1, while IOPS was 500MB on 9.2.1 during the same backup and
>> restore
>> > tasks. In the first image [1], the peak on the left represents a
>> backup, in
>> > contrast, in the 2nd image [2], the same backup operation in 9.5.0 uses
>> > less resource. As you may spot, 9.5.0 seems to be using a fifth of the
>> > resources of 9.2.1.
>> >
>> > Apart from that, monitoring some relevant metrics during the
>> operations, I
>> > had some difficulty interpreting the following metrics:
>> >
>> >
>> "ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core":
>> > 0,
>> >
>> "ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max":
>> > 5,
>> >
>> "ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size":
>> > 1,
>> >
>> "ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running":
>> > 1,
>> >
>> > The pool size was 1 although the pool max size is 5. Shouldn't the pool
>> > size be 5, instead? However, there is always one task running on a
>> single
>> > node, not 5 concurrently, if I'm not mistaken.
>> >
>> > I was also wondering if the max thread size, which is currently 5 in
>> 9.4+,
>> > could be configurable with either an environment variable or Java
>> > parameter? The part that needs to be changed seems to be in
>> > CoreAdminHandler.java on line 446 [3] I've made a small adjustment to
>> add a
>> > Solr parameter called `solr.maxExpensiveTaskThreads` for those who want
>> to
>> > set a different thread size for expensive tasks. The number given in
>> this
>> > parameter must meet the criteria of ThreadPoolExecutor, otherwise
>> > IllegalArgumentException will occur. I've generated a patch [4] and I
>> would
>> > love to see if someone from the Solr committers would take on this and
>> > apply for the upcoming release. Do you think our observation is accurate
>> > and would this patch be feasible to implement?
>> >
>> > Thanks!
>> > Hakan
>> >
>> > 1. https://i.imgur.com/aSrs8OM.png
>> > 2. https://i.imgur.com/Yr6hBM8.png
>> > 3.
>> >
>>
>> https://github.com/apache/solr/commit/82a847f0f9af18d6eceee18743d636db7a879f3e#d
>> iff-5bc3d44ca8b189f44fe9e6f75af8a5510463bdba79ff72a7d0ed190973a32533L446
>> <https://github.com/apache/solr/commit/82a847f0f9af18d6eceee18743d636db7a879f3e#diff-5bc3d44ca8b189f44fe9e6f75af8a5510463bdba79ff72a7d0ed190973a32533L446>
>> > 4. https://gist.github.com/ozlerhakan/e4d11bddae6a2f89d2c212c220f4c965
>> >
>> >
>>
>>
>>

Re: Significant Backup/Restore Performance Degradation for Large Collections

Reply via email to