Karl, This might work for you: https://issues.apache.org/jira/browse/LUCENE-293
Regards, Paul Elschot On Friday 14 December 2007 18:06:01 Karl Wettin wrote: > I have an index that contains three sorts of documents: > > Car brand > Tire brand > Tire pressure > > (Please bear with me, the real index has nothing to do with cars. I > just try to explain the problem in an alternative domain to avoid NDA > conflicts.) > > There is a heirarchial composite relationship between these sort of > documents. A document describing "tire pressure" also contains "tire > brand" and "car brand". A document describing "tire brand" also > contains information about "car brand". A document describing "car > brand" contains only that. > > The requirement is that the consumer of the API should not have to > specify what fields they are searching in. There is no time (nor > training data) to implement a hidden markov model (HMM) tokenizer or > something along that path in order to extract possible attributes from > the query string. Instead the query string is tokenized once per field > and assebled to one huge query. Normally this works fairly well. > > Here are some example documents: > > Volvo > Volvo, Michelin > Volvo, Nokian > Volvo, Nokian, 2.2 bars > Volvo, Firestone, 2.4 bars > > Saab > Saab, Michelin > Saab, Nokian > Saab, Nokian, 2.1 bars > Saab, Firestone > Saab, Firestone, 2.4 bars > Saab, Firestone, 2.5 bars > > If I search for Saab the top result will be the document representing > the car brand "Saab". The query would look like this: "car:saab > tire:saab preasure:saab" > > But lets say Saab starts manufacturing tires too: > > Saab > Saab, Saab tires > Saab, Saab tires, 1.9 bars > Saab, Saab tires, 1.8 bars > > If I search for "Saab" I still want the top result to be Saab the car > brand. But it not longer is, the match for "Saab, Saab tires" now > have a greater score than "Saab", of course. > > My idea is to work along the line of indexing "Saab" in the tire brand > and tire pressure field too. Now searching for Saab will yeild a > result where the car brand "Saab" is the top result. > > However, this will not work as I have different tokenization > strategies for each field (stemming and what not). Tokenizing the > query string Saab for the field "tire brand" in Swedish might end up > as "saa" and will thus not find the token Saab inserted for the > document describing the car brand Saab. > > I have a couple of experiments in my head I need to try out, starting > with tokezining query strings per field and using the tokens generated > for the field car brand as query in the tire brand and tire pressure > too. And vice versus. > > Any brilliant ideas that might work? Hacky solutions are OK. > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]