documents with large numbers of fields

charlie w Fri, 18 May 2007 13:02:05 -0700

Hi all,

I am trying to create an index where I can apply specific boost values to
documents, depending on the notion of something like a multiple tags for
documents.  Each tag would wind up having a different boost value in each
document.


My original approach was to add many fields with the same name, and the
value of each field would be the tag.  I applied a different boost to each
tag.  So something like:
tag=foo        ^2
tag=bar       ^1.2
tag=foobar   ^1.8
searching "tag:bar", for example.  Different documents might have different
boost values for "tag=bar", influencing the scoring.  A document might have
hundreds of these tags.  I might be searching for hundreds of different
values in these tag fields.

I have discovered this doesn't work, seemingly because of the way
DefaultSimilarity creates the fieldNorm.  It seems like internally Lucene is
effectively combining these into one tag field, with 3 terms and a boost
value that is the product of each of the 3 boost values.  If id did indeed
have a document with many many of these tag fields I'd wind up with a
freakishly huge boost on that document.

So now I have the idea to invert the field name and value thusly:
foo=tag     ^2
bar=tag     ^1.2
foobar=tag    ^1.8
and search "foo:tag".

Intuitively, I would expect Lucene to be optimized for searching the values
of fields, and not really the names of fields.  In a somewhat large index,
say 10 million documents, will Lucene search performance continue to be
acceptable if I load up documents with many fields like this?

Is there an upper limit on the number of fields comprising a document, and
if so what is it?

Or, is there some way to make my original approach work after all?

Regards and thanks,
Charlie

documents with large numbers of fields

Reply via email to