Re: dash-words

karl wettin Mon, 24 Jul 2006 06:16:13 -0700

On Mon, 2006-07-24 at 13:51 +0200, karl wettin wrote:
> On Mon, 2006-07-24 at 00:34 -0400, Yonik Seeley wrote:
> > > filter words with a dash
> > >
> > > ["x-men"]
> > > ["xmen"]
> > > ["x", "men"]
> > >
> > > The problem is ["x", "men"] requiring a distance between the terms
> > > and thus also matching "x-men men".
> > 
> > WordDelimiterFilter from Solr does this
> 
> > It also has the false match problem you mention...
> 
> Will it effect a phrase query?
> 
> I.e. would "the xmen are" be a no-match as the filtered index data
> would be "the x (men|xmen|x-men) are here"?
> 
> I'll write a test now.


Yes, it effects PhraseQuery. Only "the x men are" will match.



package org.apache.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

import java.io.Reader;
import java.util.HashSet;


public class TestWordDelimiterFilter {

    public static void main(String[] args) throws Exception {
        final String field = "field";

        Directory dir = new RAMDirectory();
        Analyzer a = new Analyzer();

        IndexWriter w = new IndexWriter(dir, a, true);
        Document d = new Document();
        d.add(new Field(field, "the x-men are here", Field.Store.NO, 
Field.Index.TOKENIZED, Field.TermVector.NO));
        w.addDocument(d);
        w.close();

        IndexSearcher is = new IndexSearcher(dir);

        PhraseQuery pq = new PhraseQuery();
        pq.add(new Term(field, "the"));
        pq.add(new Term(field, "x-men"));
        pq.add(new Term(field, "are"));
        System.out.println(is.search(pq).length());

        pq = new PhraseQuery();
        pq.add(new Term(field, "the"));
        pq.add(new Term(field, "xmen"));
        pq.add(new Term(field, "are"));
        System.out.println(is.search(pq).length());

        pq = new PhraseQuery();
        pq.add(new Term(field, "the"));
        pq.add(new Term(field, "x"));
        pq.add(new Term(field, "men"));
        pq.add(new Term(field, "are"));
        System.out.println(is.search(pq).length());

        is.close();
        dir.close();

    }

    public static class Analyzer extends org.apache.lucene.analysis.Analyzer {
        public TokenStream tokenStream(String fieldName, Reader reader) {
            TokenStream ts = new StandardAnalyzer(new 
HashSet()).tokenStream(fieldName, reader);
            ts = new WordDelimiterFilter(ts, 1,1,0,0,0);
            return ts;
        }
    }

}



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: dash-words

Reply via email to