I think there's a better way - using SpanFirstQuery. This query scores higher a term positioned closer to the beginning of a document.
Assume your resolution is 1%, i.e. 1 to 100, create the percentage field with max token position 99. Generate the tokens such that all IDs with percentage 100 would have token position of 0, all those with percentage 99 would have token position of 1, all those with percentage n would have token position of 100-n. I never done it myself but I know it is doable, see Token.setPositionIncrement(). So for document A below the percentage field would look like this: "2[50] 5[20] 13[0] 3[20]" - read this as tokenWord[tokenPositionIcrement]. So this would places "2" in position 50, "5" and "13" in position 70, and "3" in position 90. The query for "2" and "3" would be composed of two SpanFirstQuery objects - one with "2" and one with "3" - composed in a BooleanQuery I guess. You can change the effect of the distance on the score by providing your Similarity class and overriding sloppyFreq(). Regards, Doron mmoser <[EMAIL PROTECTED]> wrote on 15/12/2006 21:09:27: > > Thanks for your reply. We have currently thought about both of these > approaches, so that definitely makes me feel better about things. The first > approach you had mentioned, we had thought about our tagging problem and how > to make a product tag come to the top, but again, with a lot of tags, the > data becomes extremely large. Even if we were to take some magnitude and > apply it, it could possibly be huge due to an infinite number of tags. > > The second approach we had considered doing a version almost like that. I > had considered that if we have a query such as the one you are talking about > mixed with other fields, then the query would be extremely big. Especially > if the user was looking for 15 attribute ids which would be 15 * 10 = 150 > different ORed term searches alone if we were only searching every 10 > percent. > > I definitely thank you for this input and would love to hear if anyone has > any other approaches. If not, we will probably attempt one of these routes > and see which one is a bigger storage / performance hit. > > Mike > > > > Doron Cohen wrote: > > > > I think the right solution for this would use "payloads", where extra data > > can be added for each index token. However Lucene currently does not > > support this. Without this I can think of two options, each with its own > > disadvantage: > > > > 1) more tokens at indexing time - decide on the resolution of the > > percentage - say it is 5% - and add more tokens of the same. For example, > > the attributes field for product A in your example would look like: "2 2 2 > > 2 2 2 2 2 2 2 5 5 5 5 5 5 3 3 13 13 13 13 13 13". > > > > 2) more tokens at search time - at indexing, include the percentage in the > > token. So for product A you would have: "2x50 5x30 3x10 13x30". At search > > time, expand the query accordingly. So the query for attribute 5 would be > > expanded to: "5x5^5 5x10^10 5x15^15 5x20^20 ... 5x95^95 5x100^100". > > > > The first approach would enlarge the index, so if you have lots of data > > that could eventually be a problem. > > The second approach would end up with a large query, so, again, if you > > have > > lots of data that could eventually be a problem with search time. > > > > Also, depending how strict you want the scoring to be, you may want to > > omit > > norms for this field. > > > > Hope this helps, > > Doron > > > > mmoser <[EMAIL PROTECTED]> wrote on 15/12/2006 13:17:05: > >> > >> So, I am still new to Lucene, so please take this into consideration when > >> reading this. Up until now, a novice like myself has been able to finagle > >> Lucene into doing what we want. But now we have a problem that I have > > been > >> searching for the answer to. We allow users to profile our products with > > a > >> predetermined profile attribute id. We then want to take all the users > >> profiles on a product and take a particular number of times that this > >> particular profile attribute id has been chosen and come out with a > >> percentage for it. This is no problem. Where the problem comes into play > > is > >> that we want the user to be able to search for products that match that > >> particular profile attribute id. We want the higher percentages to come > > up > >> on top. To add to the complexity, we want to be able to allow for the > > user > >> to select multiple profile attribute ids and still have a combination of > > the > >> score to come up higher. Keep in mind, we would like to somehow keep > > these > >> in one field, because we are trying to use the same algorithm for > > something > >> that could potentially become very large. Any suggestions. The more > > detail, > >> the better. > >> > >> Example: > >> > >> Product A > >> Attribute ID = 2 Percentage Chosen = 50% > >> Attribute ID = 5 Percentage Chosen = 30% > >> Attribute ID = 3 Percentage Chosen = 10% > >> Attribute ID = 13 Percentage Chosen = 30% > >> > >> Product B > >> Attribute ID = 1 Percentage Chosen = 50% > >> Attribute ID = 2 Percentage Chosen = 20% > >> Attribute ID = 3 Percentage Chosen = 75% > >> > >> So if a user selected the attributes that correspond to 2 and 3, then > >> Product B should show up before Product A because it has a combined score > > of > >> 95% and A has a combined score of 40%. > >> > >> Thanks for any help. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > -- > View this message in context: http://www.nabble.com/Sorting-by-a- > percentage-On-a-field-tf2829501.html#a7903444 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]