"Maxym Mykhalchuk" <[EMAIL PROTECTED]> wrote on 11/04/2006 11:52:07 AM: > As for improving multi-word queries, Doug Cutting recently posted a link to > his presentation, > http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf,
> just scroll down to Nutch N-Grams there, and you'll see the answer. > Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the, > the-vampire+0, vampire; but that's only for common terms like "the", "www" > etc. I wander if it's possible to use the same technique to store phrases... The method you propose has been known for a long time, and sometimes called "Lexical Affinities". For example, http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf explains: "Lexical affinities (LAs) were first introduced by Saussure in 1947 to represent the correlation between words co-occurring in a given language and then restricted to a given document for IR purposes. LAs are identified by looking at pairs of words found in close proximity to each other. It has been described elsewhere how LAs, when used as indexing units, improve precision of search by disambiguating terms." A paper from SIGIR from as back as 1989 http://widit.slis.indiana.edu/irpub/SIGIR/1989/cite21.htm makes use of this technique. However, I'm not sure that this technique is still considered the best one. It can have a large impact on the index size, and it may be possible to get similar results with no impact on index size and just a small run-time slowdown by using something like SpanNearQuery, or a variation on this idea. Again, I didn't yet try to do this myself, so I'm not sure how successful that would be. -- Nadav Har'El --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]