Hi all,

We are trying to find an expert on Lucene indexation. I hope you might be able 
to help or direct us to the most relevant person in the community:

We have designed a Java web platform on top of JBoss Infinispan and Apache 
Lucene (version ).
It is a distributed platform which uses Lucene for full-text indexing but also 
for structured documents indexing in order to reflect objects relationships and 
allow performing complex query on them.
The actual data is distributed across cluster nodes and stored in memory (you 
can see it as a distributed in-memory key/value store). It includes Lucene 
indexes data, which are stored as metadata blocks, and chunks (lucene files are 
handled as chunks to allow even distribution).

We use one index per object type, mainly, and use some sort of sharding in a 
context of multi-tenancy. The bigger indexes could handle millions of documents 
(more sharding is investigated, which could lower this to 5 figures numbers). 
Currently, we go up to about 300000 documents in one index.

This architecture works nicely functionally speaking, but now that it is in 
production, we encounter several issues related to indexing:


-          We took care of having only one writer per index in the whole 
cluster, as required by Lucene. We did that using our distributed locking 
system. However, it seems that Lucene fail on some lower level locks when 
preparing for write. We suspect that it does something we don't understand with 
locking, maybe related to refresh/merge of documents.

-          We see that it does document merging (and so deleting) all the time, 
which may impact indexing throughput.

-          We have the feeling that indexing takes too long in general

So, we are mainly looking for advice about:

-          Understanding better the locking strategy of Lucene in regards to 
reading, writing and refreshing/merging

-          Any particular best practice in handling Lucene indexes in a 
distributed architecture

-          Tuning Lucene in regards to throughput, memory/CPU usage, 
refreshing/merging , data volume and GC

Gilles


Reply via email to