On Wed, Feb 24, 2010 at 11:20 PM, Aaron Lav <a...@pobox.com> wrote: > On Wed, Feb 24, 2010 at 10:18:27PM +0200, Avi Rosenschein wrote: > > On Wed, Feb 24, 2010 at 3:42 PM, Grant Ingersoll <gsing...@apache.org > >wrote: > > > > > What would it be? > > > > > > > For scoring to take into account the non-analyzed token stream. > > > > That is, if a field is analyzed (stemmed, lowercased, maybe even stop > words > > removed), that is fine for indexing. But tokens in the query matching the > > original form could still get a higher score than those that only match > when > > analyzed. > > You can get some of that effect by indexing stemmed and unstemmed > forms, and letting IDF boost unstemmed results. (I picked this > idea up from http://lingpipe-blog.com/2007/03/21/to-stem-or-not-to-stem/) >
This is not quite the same (either in relevance or efficiency). I would like the infrastructure for this to be built into Lucene, so that queries and scorers could take advantage of it. > > Also, this would maybe allow a flexible, run-time, decision of what > > analyzers to include. For example, I might want stemming turned on for > > normal search, but not for a PhraseQuery. > > That's harder - different field names for the different analyses might > work, but not for run-time decisions. I think the way Sun's Minion does > it is morphologically-based query expansion (see > http://blogs.sun.com/searchguy/entry/lightweight_morphology_vs_stemming), > and you might be able to > implement that via query rewriting. > Again, rather than forcing me to store a separate field for every possible type of query I might want to build, Lucene should be able to efficiently store the original information in a form conducive to using at query time.