Hi all, We are trying to find an expert on Lucene indexation. I hope you might be able to help or direct us to the most relevant person in the community:
We have designed a Java web platform on top of JBoss Infinispan and Apache Lucene (version ). It is a distributed platform which uses Lucene for full-text indexing but also for structured documents indexing in order to reflect objects relationships and allow performing complex query on them. The actual data is distributed across cluster nodes and stored in memory (you can see it as a distributed in-memory key/value store). It includes Lucene indexes data, which are stored as metadata blocks, and chunks (lucene files are handled as chunks to allow even distribution). We use one index per object type, mainly, and use some sort of sharding in a context of multi-tenancy. The bigger indexes could handle millions of documents (more sharding is investigated, which could lower this to 5 figures numbers). Currently, we go up to about 300000 documents in one index. This architecture works nicely functionally speaking, but now that it is in production, we encounter several issues related to indexing: - We took care of having only one writer per index in the whole cluster, as required by Lucene. We did that using our distributed locking system. However, it seems that Lucene fail on some lower level locks when preparing for write. We suspect that it does something we don't understand with locking, maybe related to refresh/merge of documents. - We see that it does document merging (and so deleting) all the time, which may impact indexing throughput. - We have the feeling that indexing takes too long in general So, we are mainly looking for advice about: - Understanding better the locking strategy of Lucene in regards to reading, writing and refreshing/merging - Any particular best practice in handling Lucene indexes in a distributed architecture - Tuning Lucene in regards to throughput, memory/CPU usage, refreshing/merging , data volume and GC Gilles
