As far as I know, this is the case where you want your custom Similarity that knows how to deal with a small number of terms.
public float lengthNorm(String fieldName, int numTerms) { if (numTerms < N) // return something smart return (float)(1.0 / Math.sqrt(numTerms)); } I think the rest of what you said is correct. Look at this piece of Similarity javadoc: * However the resulted <i>norm</i> value is [EMAIL PROTECTED] #encodeNorm(float) encoded} as a single byte * before being stored. * At search time, the norm byte value is read from the index * [EMAIL PROTECTED] org.apache.lucene.store.Directory directory} and * [EMAIL PROTECTED] #decodeNorm(byte) decoded} back to a float <i>norm</i> value. * This encoding/decoding, while reducing index size, comes with the price of * precision loss - it is not guaranteed that decode(encode(x)) = x. * For instance, decode(encode(0.89)) = 0.75. * Also notice that search time is too late to modify this <i>norm</i> part of scoring, e.g. by * using a different [EMAIL PROTECTED] Similarity} for search. If you come up with a more generic lengthNorm that dels with "overly short" documents/fields well, I'd love to know! :) Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: John Kleven <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, April 5, 2007 1:45:34 PM Subject: Re: short documents = help me tweak Similarity?? Sorry to re-post -- is this the correct forum for questions like this? I think that writing a new encode/decode operation should help alleviate my problem, but thought that this must be fairly widespread issue for people using lucene for "non-web-page" searches (i.e., shorter documents) Thanks again, John On 4/2/07, John Kleven <[EMAIL PROTECTED]> wrote: > > My documents are cars... > i.e., > Nissan Altima Sports Package > Nissan Altima Standard > > The problem I have is when i search "Nissan Altima", I want to get the 2nd > hit back first, i.e. "Nissan Altima Standard", because it is shorter. > However, this doesn't happen. They are both scored the exact same. > > I know that the lengthNorm in Similarity is using 1/sqrt(numTerms), and > you would think that would be enuff to make sure the order is correct. > However, it is not, and I assume this is because of the encode/decode > functions that pack this value into a single byte do not have the > granularity to represent differences between numbers like 1/sqrt(3) vs > 1/sqrt(4)?? > > Is the suggested approach here to re-write the encode/decode operations, > or is there any easier way? > > Thanks kindly - > John --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]