I'm already too late to the party but +1 on Mike McCandleless comment on RAM. There are a number of efforts in the past to move to off heap. FWIW you can use Lucene FST (a trie-based datastructure used in many different places including term index, synonym dictionary, etc) to build a large (>>32GB) with just a small amount of memory (few to tens MB, depending on how critical we want to create a minimal FST). But it has not been incorporated into Lucene default index codec yet. The Lucene FST term index usually doesn't require much memory though, as it would split into smaller segments that operate independently and the FST only contains prefixes of terms, instead of the actual terms.
One thing is that I think the vector index building is not off heap yet, but possible? (Or would we be much better off if using an on disk datastructure directly?) But eventually during searching, when using off heap we are relying on the OS memory mapping to load hot/frequently accessed pages to memory so having enough memory is still critical. That depends on the access patterns (how much frequently you are accessing some specific portions of the disk) that needs to be tuned per system. On Fri, Jan 5, 2024 at 3:51 Ralf Heyde <ralf.he...@gmx.de.invalid> wrote: > Hi Vincent, > > My 2 cents: > > We had a production environment with ~250g and ~1M docs with static + > dynamic fields in Solr (afair lucene 7) with a machine having 4GB for the > jvm and (afair) a little bit more maybe 6GB OS ‚cache‘. > In peak times (re-index) we had 10-15k updates / minute and (partially) > complex queries up to 50/sec per jvm. At those times our servers still had > rotating discs. > > In this setup we did not experience any performance issues when we did not > had bugs / misconfigurations. > > We were thinking of sharding / splitting indexes, but did not do it due to > complexity of maintaining those later - AND especially - there was NO NEED > at all. > > Elasticsearch/Solr started to do it out of the box at that time. Maybe > Kibana/ELK or such is a thing to look at too. > > Cheers from Berlin, Ralf > > Von meinem Telefon gesendet, etwaige Rechtschreibfehler kann ich nicht > ausschliessen > > > Am 04.01.2024 um 17:32 schrieb Michael McCandless < > luc...@mikemccandless.com>: > > > > Hi Vincent, > > > > Lucene has a hard limit of ~2.1 B documents in a single index; hopefully > > you hit the ~50 - 100 GB limit well before that. > > > > Otherwise it's very application dependent: how much latency can you > > tolerate during searching, how fast are the underlying IO devices at > random > > and large sequential IO, the types of queries, etc. > > > > Lucene should not require much additional RAM as the index gets larger -- > > much work has been done in recent years to move data structures off-heap. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > >> On Tue, Jan 2, 2024 at 9:49 AM <vvse...@gmail.com> wrote: > >> > >> Hello, > >> > >> is there a recommended / rule of thumb maximum size for index? > >> I try to target between 50 and 100 Gb, before spreading to other > servers. > >> or is this just a matter of how much memory and cpu I have? > >> this is a log aggregation use case. a lot of write, smaller number of > reads > >> obviously. > >> I am using lucene 9. > >> thanks, > >> Vincent > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >