Hi. We have in-house distributed Lucene setup. 40 dual-socket servers with approximatley 700 cores divided in 7 partitions. Those machines are doing index search only. Indexes are prepared on several isolated machines (so called, Index Masters) and distributed over the cluster with plain rsync.
The search speed is great, but we need more indexation throughput. Index Masters are becoming CPU-bounded lately. The reason is we use quite complicated analysis pipeline using morphological dictionary as opposed to stemming and some NER-elements. Right now indexation throughput is about ~1-1.5K documents per second. Considering corpus size of 140 million documents, full reindex is about day or so. We want better. Out target at the moment >10K documents per second. It seems like Lucene by itself can handle this requirement. It's just our comparatively slow analysis pipeline can't. So we have a Plan. To move analysis algorithm from Index Master dedicated boxes where it can be easily scaled, as being stateless. The problem we facing is Lucene at the moment doesn't have serializable Document representation which can be used for communicating over network. We are planning to implement this kind of representation. The question is there any pitfals or problems we'd better know before starting? :) Denis. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org