Re: Question about many fields within a single index

Jason Tesser Thu, 31 Dec 2009 03:10:27 -0800

right we do analyze a number of fields.  We use the WHiteSpace
whenever we have a text field. So maybe 5 on average per guy. Can be
more of course.


Thanks,
Jason Tesser
dotCMS Lead Development Manager
1-305-858-1422



On Wed, Dec 30, 2009 at 10:44 PM, Tom Hill <solr-l...@worldware.com> wrote:
> Hi -
>
> One thing to consider is field norms. If your fields aren't analyzed, this
> doesn't apply to you.
>
> But if you do have norms, I believe that it's one by per field with norms x
> number of documents. It doesn't matter if the field occurs in a document or
> not, it's nTotalFields x nDocs.
>
> So, an index with 10,000 documents, with one field each, same field for all
> docs:
>
> -rw-r--r--   1 tom   wheel   10004 Dec 30 18:54 _0.nrm
>
> a 10,000 doc index, where each doc has one of 100 different field names, but
> still only one field per doc:
>
> -rw-r--r--   1 tom   wheel  1000004 Dec 30 18:55 _0.nrm
>
> As you can see, your total space for norms goes up linearly with the number
> of fields.
>
> Tom
>
> On Wed, Dec 30, 2009 at 10:19 AM, Renaud Delbru <renaud.del...@deri.org>wrote:
>
>> Hi,
>>
>> just sharing some personal experiences in this domain,
>>
>> We performed some benchmarks in a similar setup (indexing millions of
>> documents with thousands of fields) to measure the impact of large number of
>> fields on a Lucene index.
>> We observed that more you have fields, more the dictionary will become
>> large. In fact, Lucene is creating term index by concatenating the field
>> name with the terms associated to this field. In the worst case scenario
>> (when terms occur in every field), you can have M fields * N terms entries
>> in the dictionary.
>> As a consequence, a term lookup in the dictionary will take longer. This
>> can have a significant impact when the cache is cold. When the cache is warm
>> (when parts of the dictionary are in memory), the time overhead is not
>> significant, even null.
>>
>> If, in your data collection, the majority of the terms are locals to one
>> field (i.e., when a term occurs only in one single field, like for example
>> the timestamp of the document), your dictionary will not grow that much and
>> therefore you will probably not notice the increase in time of dictionary
>> lookups.
>>
>> And, as erick explained previously, its depends also of the size of your
>> index, and of your use case. If it is relatively small, the overhead will be
>> imperceptible. However, if your index is large (millions of documents), you
>> will probably notice the overhead the first time the query is executed
>> (cold-cache).
>>
>> I haven't tested with Lucene 3.0, but I have read somewhere (correct me if
>> I am wrong) that this new version includes some optimisations for dictionary
>> lookups, which should minimize the overhead.
>> --
>> Renaud Delbru
>>
>>
>> On 30/12/09 16:18, Jason Tesser wrote:
>>
>>> I have a situation where I might have 1000 different types of Lucene
>>> Documents each with 10 or so fields with different names that get
>>> indexed.
>>>
>>> I am wondering if this is bad to do within Lucene.  I end up with
>>> 10,000 fields within the index although any given document has only 10
>>> or so.
>>>
>>> I was hoping not to have to have many indexes under the covers if I
>>> can avoid it but I don't want performance to suffer either.
>>>
>>> Any thoughts?
>>>
>>> Thanks,
>>> Jason Tesser
>>> dotCMS Lead Development Manager
>>> 1-305-858-1422
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Question about many fields within a single index

Reply via email to