Hi.

We have in-house distributed Lucene setup. 40 dual-socket servers with 
approximatley 700 cores divided in 7 partitions. Those machines are doing index 
search only. Indexes are prepared on several isolated machines (so called, 
Index Masters) and distributed over the cluster with plain rsync.

The search speed is great, but we need more indexation throughput. Index 
Masters are becoming CPU-bounded lately. The reason is we use quite complicated 
analysis pipeline using morphological dictionary as opposed to stemming and 
some NER-elements. Right now indexation throughput is about ~1-1.5K documents 
per second. Considering corpus size of 140 million documents, full reindex is 
about day or so. We want better. Out target at the moment >10K documents per 
second. It seems like Lucene by itself can handle this requirement. It's just 
our comparatively slow analysis pipeline can't.

So we have a Plan.

To move analysis algorithm from Index Master dedicated boxes where it can be 
easily scaled, as being stateless. The problem we facing is Lucene at the 
moment doesn't have serializable Document representation which can be used for 
communicating over network.

We are planning to implement this kind of representation. The question is there 
any pitfals or problems we'd better know before starting? :)

Denis.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to