Yes, the number of documents is not too large (about 90 000), but the queries 
are very hard. Although they're just boolean, a typical query can produce a 
result with tens of millions of hits.
Single-threadedly such a query runs ~20 seconds, which is too slow. therefore, 
multithreading is vital for this task.

As you mentioned, merges are the source of non-uniform segments sizes. 
Therefore, as my index is fully static (every time I need a re-index, I can do 
it from scratch), I'm gonna give a try to NoMergePolicy with some reasonable 
maximum segment size.
If there are some other multithreading caveats, they're highly welcomed.

-- 
Best Regards,
Igor

02.04.2013, 18:07, "Adrien Grand" <jpou...@gmail.com>:
> On Tue, Apr 2, 2013 at 2:29 PM, Igor Shalyminov
> <ishalymi...@yandex-team.ru> wrote:
>
>>  Hello!
>
> Hi Igor,
>
>>  I have a ~20GB index and try to make a concurrent search over it.
>>  The index has 16 segments, I run SpanQuery.getSpans() on each segment 
>> concurrently.
>>  I see really small performance improvement of searching concurrently. I 
>> suppose, the reason is that the sizes of the segments are very non-uniform 
>> (3 segments have ~20 000 docs each, and the others have less than 1 000 
>> each).
>>  How to make more uniformly sized segments (I now use just 
>> writer.forceMerge(16)), and are multiple index segments the most important 
>> thing in Lucene concurrency?
>
> Segments have non uniform sizes by design. A segment is generated
> every time a flush happens (when the ram buffer is full or if you
> explicitely call commit). When there are two many segments, Lucene
> merges some of them while new segments keep being generated as you add
> data. So the "flush" segments will always be small while segments
> resulting from a merge will be much larger since they contain data
> from several other segments.
>
> Even if segments are collected concurrently, IndexSearcher needs to
> merge the results of the collection of each segments in the end. Since
> your segments are very small (20000 docs), maybe the cost of
> initialization/merge is not negligible compared to single-segment
> collection.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to