Sounds like something that would be very useful for folks. I'm sure it'd be very dependent on your data and the type of backup, but I'm curious - if you can share Pierre - is there a number of cores-per-node being backed up where you start to see problems?
Jason On Wed, Jun 21, 2023 at 8:34 AM Pierre Salagnac <pierre.salag...@gmail.com> wrote: > > Thanks for starting this thread David. > > I've been internally working on this, since we have issues (query failures) > during backups of big collections because of IO saturation. > > I see two different approaches to solve this: > 1. Throttle at the IO level, like David mentioned. > 2. Limit the number of cores we backup concurrently. > (These two options are *not* mutually exclusive.) > > I've been focused on the second option, to limit the number of concurrent > backups per node. Currently, the overseer sends shard requests to all > shards in a simple 'for' loop. If the collection has one thousand shards, > we'll start 1 thousand concurrent backups. The idea is to only send shard > level requests up to a certain limit per node, and then each time a shard > is complete, we send the next one for this node. > If you're interested, I integrated my experiment (for non incremental > backups) here: > https://github.com/psalagnac/solr/commit/c77c94e9a3c20aee3e45ec1198f00ab9cf0f76c5 > > I don't think backup is the only operation that should be considered. At > least restore is, not sure whether we have other IO intensive operations > that are at the collection level. Ideally, we should have something generic > and not consider each type of operation individually. > > Thanks > > > Le mar. 20 juin 2023 à 09:58, Ishan Chattopadhyaya < > ichattopadhy...@gmail.com> a écrit : > > > Might be a good question for users@ list, I guess. I'm sure other users > > must've thought about this. > > Cross posting there, as I'm curious myself too. > > > > On Tue, 20 Jun 2023 at 01:07, David Smiley <dsmi...@apache.org> wrote: > > > > > Has anyone mitigated the potentially large IO impact of doing a backup > > of a > > > large collection or just in general? If the collection is large enough, > > > there very well could be many shards on one host and it could saturate > > the > > > IO. I wonder if there should be a rate limit mechanism or some other > > > mechanism. > > > > > > Not the same but I know that at a segment level, the merges are rate > > > limited -- ConcurrentMergeScheduler doesn't quite let you set it but > > > adjusts itself automatically ("ioThrottle" boolean). > > > > > > ~ David Smiley > > > Apache Lucene/Solr Search Developer > > > http://www.linkedin.com/in/davidwsmiley > > > > >