right we do analyze a number of fields. We use the WHiteSpace whenever we have a text field. So maybe 5 on average per guy. Can be more of course.
Thanks, Jason Tesser dotCMS Lead Development Manager 1-305-858-1422 On Wed, Dec 30, 2009 at 10:44 PM, Tom Hill <solr-l...@worldware.com> wrote: > Hi - > > One thing to consider is field norms. If your fields aren't analyzed, this > doesn't apply to you. > > But if you do have norms, I believe that it's one by per field with norms x > number of documents. It doesn't matter if the field occurs in a document or > not, it's nTotalFields x nDocs. > > So, an index with 10,000 documents, with one field each, same field for all > docs: > > -rw-r--r-- 1 tom wheel 10004 Dec 30 18:54 _0.nrm > > a 10,000 doc index, where each doc has one of 100 different field names, but > still only one field per doc: > > -rw-r--r-- 1 tom wheel 1000004 Dec 30 18:55 _0.nrm > > As you can see, your total space for norms goes up linearly with the number > of fields. > > Tom > > On Wed, Dec 30, 2009 at 10:19 AM, Renaud Delbru <renaud.del...@deri.org>wrote: > >> Hi, >> >> just sharing some personal experiences in this domain, >> >> We performed some benchmarks in a similar setup (indexing millions of >> documents with thousands of fields) to measure the impact of large number of >> fields on a Lucene index. >> We observed that more you have fields, more the dictionary will become >> large. In fact, Lucene is creating term index by concatenating the field >> name with the terms associated to this field. In the worst case scenario >> (when terms occur in every field), you can have M fields * N terms entries >> in the dictionary. >> As a consequence, a term lookup in the dictionary will take longer. This >> can have a significant impact when the cache is cold. When the cache is warm >> (when parts of the dictionary are in memory), the time overhead is not >> significant, even null. >> >> If, in your data collection, the majority of the terms are locals to one >> field (i.e., when a term occurs only in one single field, like for example >> the timestamp of the document), your dictionary will not grow that much and >> therefore you will probably not notice the increase in time of dictionary >> lookups. >> >> And, as erick explained previously, its depends also of the size of your >> index, and of your use case. If it is relatively small, the overhead will be >> imperceptible. However, if your index is large (millions of documents), you >> will probably notice the overhead the first time the query is executed >> (cold-cache). >> >> I haven't tested with Lucene 3.0, but I have read somewhere (correct me if >> I am wrong) that this new version includes some optimisations for dictionary >> lookups, which should minimize the overhead. >> -- >> Renaud Delbru >> >> >> On 30/12/09 16:18, Jason Tesser wrote: >> >>> I have a situation where I might have 1000 different types of Lucene >>> Documents each with 10 or so fields with different names that get >>> indexed. >>> >>> I am wondering if this is bad to do within Lucene. I end up with >>> 10,000 fields within the index although any given document has only 10 >>> or so. >>> >>> I was hoping not to have to have many indexes under the covers if I >>> can avoid it but I don't want performance to suffer either. >>> >>> Any thoughts? >>> >>> Thanks, >>> Jason Tesser >>> dotCMS Lead Development Manager >>> 1-305-858-1422 >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org