Very large number of indices in distributed environment

Benjamin Reitzammer Thu, 04 Aug 2005 01:54:04 -0700

Hi,
we are in the process of planning a search feature of a product and we
are having quite a hard time figuring out the "right" way to do it.


The requirements for our app are the following:
1) Large number of indices (at _least_ 10000)
2) The amount of data involved per index is not very high, but because
of the number of indices involved the data set will be something about
500 - 1000 GB
3) The searching capabilities must be fail safe, while it's acceptable
if deletes/updates can take some time.
4) The majority of operations will be searching the indices.

I've followed the mailing list intensively the last month and
especially the "Best Practices for Distributing Lucene Indexing and
Searching" (http://marc.theaimsgroup.com/?l=lucene-user&m=110971318020691&w=2)
and "Real time indexing and distribution to lucene on separate boxes (long)"
(http://marc.theaimsgroup.com/?l=lucene-user&m=107900097217474&w=2)
threads provided some interesting insight.

But still our requirements are a bit different.

My thoughts how the above could be handled, so far are:

1) Have one *really big* "master"  which handles all tasks related to
index manipulation. Sync the indices according to Doug's tips
http://marc.theaimsgroup.com/?l=lucene-user&m=110973989200204&w=2 out
to a cluster of slaves that are responsible for searching.
Problem: How to make sure that indices across  slaves are in sync. 
Big Problem: Syncing of this large number of indices will cause a lot
of traffic and cause already quite a load on the slaves (not to speak
of the master)

1.1) Is it safe to _search_ (only) an index mounted via NFS? If yes,
then the search boxes could mount the indices on the master box. But
this solution would probably lead to some serious perfomance issues
because of the needed disk I/O on the master.
Though I'd love to be proven wrong on this one.

2) Split up index collection into smaller portions and distribute a
certain number of indices (~ up to 1000 indices) into smaller
autonomous clusters, that are completely responsible for their
collection of indices.
Problem: How do I keep index distribution dynamic so I don't have to
hardcode where to look for a certain index (that's not a real lucene
issue, but more one of distributed computing, but nevertheless I
thought you guys might know a way to solve it).

Any ideas on this? 
Has anyone ever worked with such a large number of Lucene indices (and
the amount of data it involves)?

I appreciate your help very much.

Cheers

Benjamin

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Very large number of indices in distributed environment

Reply via email to