I'm preparing to help a company run a scalability test and decide whether or not to use Lucene. Relevant particulars for the test include:
1. 2 pairs of indices. Each pair has 1 index with about 7.5 million small documents and 1 index with about 1 million large documents. Each index has a substantial number of (small) fields in addition to the documents.
2. Searching will done using a node for each index pair -- i.e., the test will use a MultiSearcher accessing the remote indices.
3. Indexing and searching will be done simultaneously -- indexing will be incremental and continual. There are no deletes.
4. The platform is Windows
5. Both search and indexing time are essential, and so need to be balanced.


Based on some early measurements with small test sets, but mostly first principles, I'm thinking of using these settings. The index will take a long time to create and I probably get only one chance to prove what Lucene can do, and so I'd appreciate any good advice or experience that would suggest different settings:

index.setMaxBufferedDocs(10); // Buffer 10 documents at a time in memory (they could be big)
index.setMaxFieldLength(Integer.MAX_VALUE); // We do the limiting ourselves by what we pass in
index.setMaxMergeDocs(100000); // Yields about 75 large segments for 7.5 million docs (plus log2 smaller segments) = 100 total
index.setMergeFactor(2); // Faster searches due to fewer (small) segements, but slower indexing due to more frequent merging
index.setSimilarity(similarity);
index.setTermIndexInterval(128); // Default. Larger nubmer will reduce memory at cost of slower term access
index.setUseCompoundFile(true); // false could improve performance but will consume more file handles


Thanks for any suggestions!

Chuck


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to