Re: Small field indexing and ranking

Nadav Har'El Tue, 11 Apr 2006 02:21:48 -0700

"Maxym Mykhalchuk" <[EMAIL PROTECTED]> wrote on 11/04/2006 11:52:07 AM:
> As for improving multi-word queries, Doug Cutting recently posted a link
to
> his presentation,
> http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf,


> just scroll down to Nutch N-Grams there, and you'll see the answer.
> Basically, "Buffy the Vampire" is indexed as buffy, buffy-the+0, the,
> the-vampire+0, vampire; but that's only for common terms like "the",
"www"
> etc. I wander if it's possible to use the same technique to store
phrases...

The method you propose has been known for a long time, and sometimes called
"Lexical Affinities".
For example, http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf
explains:

   "Lexical affinities (LAs) were first introduced by Saussure in 1947
    to represent the correlation between words co-occurring in a given
    language and then restricted to a given document for IR purposes.
    LAs are identified by looking at pairs of words found in close
    proximity to each other. It has been described elsewhere how LAs,
    when used as indexing units, improve precision of search by
    disambiguating terms."

A paper from SIGIR from as back as 1989
http://widit.slis.indiana.edu/irpub/SIGIR/1989/cite21.htm
makes use of this technique.

However, I'm not sure that this technique is still considered the
best one. It can have a large impact on the index size, and it may
be possible to get similar results with no impact on index size
and just a small run-time slowdown by using something like
SpanNearQuery, or a variation on this idea. Again, I didn't yet
try to do this myself, so I'm not sure how successful that would
be.

--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Small field indexing and ranking

Reply via email to