Hi all, I am trying to create an index where I can apply specific boost values to documents, depending on the notion of something like a multiple tags for documents. Each tag would wind up having a different boost value in each document.
My original approach was to add many fields with the same name, and the value of each field would be the tag. I applied a different boost to each tag. So something like: tag=foo ^2 tag=bar ^1.2 tag=foobar ^1.8 searching "tag:bar", for example. Different documents might have different boost values for "tag=bar", influencing the scoring. A document might have hundreds of these tags. I might be searching for hundreds of different values in these tag fields. I have discovered this doesn't work, seemingly because of the way DefaultSimilarity creates the fieldNorm. It seems like internally Lucene is effectively combining these into one tag field, with 3 terms and a boost value that is the product of each of the 3 boost values. If id did indeed have a document with many many of these tag fields I'd wind up with a freakishly huge boost on that document. So now I have the idea to invert the field name and value thusly: foo=tag ^2 bar=tag ^1.2 foobar=tag ^1.8 and search "foo:tag". Intuitively, I would expect Lucene to be optimized for searching the values of fields, and not really the names of fields. In a somewhat large index, say 10 million documents, will Lucene search performance continue to be acceptable if I load up documents with many fields like this? Is there an upper limit on the number of fields comprising a document, and if so what is it? Or, is there some way to make my original approach work after all? Regards and thanks, Charlie