While most of our Lucene indexes are used for more traditional searching, this 
index in particular is used more like a reporting repository. Thus, we really 
do need to have that many fields indexed and they do need to be broken out into 
separate fields. There may be another way to structure the index to reduce the 
number of fields, but I'm hoping we can optimize the current design and avoid 
(yet another) index redesign.

I'll look into the tweaking the merge policy, but I'm more interested in 
disabling norms because scoring really doesn't matter for us. Basically, we 
need nothing more than a binary answer from Lucene: either a record meets the 
provided criteria (which can be a rather complex boolean query with many 
subqueries) or it doesn't. If the record does match, then we get the IDs from 
lucene and run off to get the live data from our primary data store and sort it 
(in Java) based upon criteria provided by the user, not by score.

After our initial design mushroomed in size, we redesigned and now (I thought) 
do not have norms on any of the fields in this index. So, I'm wondering if 
there was something in the results from the CheckIndex that I provided which 
indicates to you that we may have norms still enabled? I know that if you have 
norms on any one document's field, then any other document with that same field 
will get "infected" with norms as well.

My understanding is that any field that uses the constants  
Index.NOT_ANALYZED_NO_NORMS or  Index.NO will not  have norms on it, regardless 
of whether or not the field is stored. Is that not correct?

Thanks,
Mark



On Nov 4, 2010, at 2:56 AM, Michael McCandless wrote:

> Likely what happened is you had a bunch of smaller segments, and then
> suddenly they got merged into that one big segment (_aiaz) in your
> index.
> 
> The representation for norms in particular is not sparse, so this
> means the size of the norms file for a given segment will be
> number-of-unique-indexed-fields X number-of-documents.
> 
> So this count grows quadratically on merge.
> 
> Do these fields really need to be indexed?   If so, it'd be better to
> use a single field for all users for the indexable text if you can.
> 
> Failing that, a simple workaround is to set the maxMergeMB/Docs on the
> merge policy; this'd prevent big segments from being produced.
> Disabling norms should also workaround this, though that will affect
> hit scores...
> 
> Mike
> 
> On Wed, Nov 3, 2010 at 7:37 PM, Mark Kristensson
> <mark.kristens...@smartsheet.com> wrote:
>> Yes, we do have a large number of unique field names in that index, because 
>> they are driven by user named fields in our application (with some cleaning 
>> to remove illegal chars).
>> 
>> This slowness problem has appeared very suddenly in the last couple of weeks 
>> and the number of unique field names has not spiked in the last few weeks. 
>> Have we crept over some threshold with our linear growth in the number of 
>> unique field names? Perhaps there is a limit driven by the amount of RAM in 
>> the machine that we are violating? Are there any guidelines for the maximum 
>> number, or suggested number, of unique fields names in an index or segment? 
>> Any suggestions for potentially mitigating the problem?
>> 
>> Thanks,
>> Mark
>> 
>> 
>> On Nov 3, 2010, at 2:02 PM, Michael McCandless wrote:
>> 
>>> On Wed, Nov 3, 2010 at 4:27 PM, Mark Kristensson
>>> <mark.kristens...@smartsheet.com> wrote:
>>>> 
>>>> I've run checkIndex against the index and the results are below. That net 
>>>> is that it's telling me nothing is wrong with the index.
>>> 
>>> Thanks.
>>> 
>>>> I did not have any instrumentation around the opening of the IndexSearcher 
>>>> (we don't use an IndexReader), just around the actual query execution so I 
>>>> had to add some additional logging. What I found surprised me, opening a 
>>>> search against this index takes the same 6 to 8 seconds that closing the 
>>>> indexWriter takes.
>>> 
>>> IndexWriter opens a SegmentReader for each segment in the index, to
>>> apply deletions, so I think this is the source of the slowness.
>>> 
>>> From the CheckIndex output, it looks like you have many (296,713)
>>> unique fields names on that one large segment -- does that sound
>>> right?  I suspect such a very high field count is the source of the
>>> slowness...
>>> 
>>> Mike
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

Reply via email to