Re: Issues with escaping special characters

Ari Miller Thu, 14 May 2009 18:19:52 -0700

I buy your theory that StandardAnalyzer is breaking up the stream, and that
this might be an indexing issue, rather than a query issue.  When I look at
my index in Luke, as far as I can tell the literal (Parenth+eses is stored,
not the broken up tokens.  Also, I can't seem to find an Analyzer that
doesn't suffer from these issues.


I've created a standalone test case that demonstrates the current behavior.
I've attached it to this email.  It should just require junit and lucene. It
might actually be useful in general for anyone trying to figure out various
Lucene behaviors.

At a high level, am I correctly understanding that Lucene doesn't support
searching on indexed special characters without significant additional
machinations?  If so, has anyone gone through those machinations and posted
a link :)?  Given the test case, is this worthy of a JIRA issue?

On Thu, May 14, 2009 at 4:59 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> I suspect that what's happening is that StandardAnalyzer is breaking
> your stream up on the "odd" characters. All escaping them on the
> query does is insure that they're not interpreted by the parser as (in
> this case), the beginning of a group and a MUST operator. So, I
> claim it correctly feeds (Parenth+eses to the analyzer, which then
> breaks it up into the tokens you indicated.
>
> Assuming you've tried to index this exact string with StandardAnalyzer,
> if you looked in your index (say with Luke), you'd see that "parenth" and
> "esis" were the tokens indexed.
>
> Warning: I haven't used the ngram tokenizers, so I know just enough to
> be dangerous. That said, you could tokenize these as ngrams. I'm not sure
> what the base ngram tokenizer does with your special characters, but you
> could pretty easily create your own analyzer that spits out, say, 2-(or
> whatever)
> grams and use that to index and search, possibly using a second field(s)
> for
> the data you wanted to treat this way...
>
> HTH
> Erick
>
> On Thu, May 14, 2009 at 7:18 PM, Ari Miller <ari1...@gmail.com> wrote:
>
> > Say I have a book title, literally:
> >
> > (Parenth+eses
> >
> > How would I do a search to find exactly that book title, given the
> presence
> > of the ( and + ?  QueryParser.escape isn't working.
> > I would expect to be able to search for (Parenth+eses  [exact match] or
> > (Parenth+e  [partial match]
> > I can use QueryParser.escape to escape out the user search term, but
> > feeding
> > that to QueryParser with a StandardAnalyzer doesn't return what I would
> > expect.
> >
> > For example, (Parenth+eses --> QueryParser.escape --> \(Parenth\+eses,
> when
> > parsed becomes:
> > PhraseQuery:
> >    Term:parenth
> >    Term:eses
> >
> > Note that the escaped special characters seem to be turned into spaces,
> not
> > used literally.
> > Up to this point, even attempting to directly create an appropriate query
> > (PrefixQuery, PhraseQuery, TermQuery, etc.), I've been unable to come up
> > with a query that will match the text with special characters and only
> that
> > text.
> > My longer term goal is to be able to take a user search term, identify it
> > as
> > a literal term (nothing inside should be treated as lucene special
> > characters), and do a PrefixQuery with that literal term.
> >
> > In case it matters, the field I'm searching on is indexed, tokenized, and
> > stored.
> >
> > Potentially relevant existing JIRA issues:
> > http://issues.apache.org/jira/browse/LUCENE-271
> > http://issues.apache.org/jira/browse/LUCENE-588
> >
> > Thanks,
> > Ari
> >
>

import static org.junit.Assert.*;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Test;

public class StandaloneLuceneTest  {
    private static final String FIELD = "MyField";
    private IndexWriter indexWriter;
    private RAMDirectory ramDirectory;
    
    @Test
    public void testKeywordSearch() throws ParseException, CorruptIndexException, IOException {
        String datapointTitle = "(Parenth+eses";
        String escapedDatapointTitle = QueryParser.escape(datapointTitle);
        
        String simpleDatapointTitle = "Ghostbusters";
        String truncatedSimpleDatapointTitle = "Ghostbuste";
        
        ramDirectory = new RAMDirectory();
        // Create the index
        openIndex();
        addToIndex(datapointTitle);
        addToIndex(simpleDatapointTitle);
        closeAndOptimizeIndex();
        
    
        assertEquals("Term query works on simple datapoint name",
                1,
                runQuery(asTermQuery(simpleDatapointTitle)));        
        
        assertEquals("Phrase query works on simple datapoint name",
                1,
                runQuery(asPhraseQuery(simpleDatapointTitle)));        
                       
        
        assertEquals("Term query with escaped literal title doesn't work",
                0,
                runQuery(asTermQuery(escapedDatapointTitle)));  
        
        assertEquals("Phrase query with escaped literal title doesn't work",
                0,
                runQuery(asPhraseQuery(escapedDatapointTitle)));  
        
        assertEquals("Term query with literal title doesn't work",
                0,
                runQuery(asTermQuery(datapointTitle)));  
        
        assertEquals("Phrase query with literal title doesn't work",
                0,
                runQuery(asPhraseQuery(datapointTitle)));  
        
        

        // Using the parser
        assertEquals("Without a wildcard, the search works",
                1,
                searchWithKeyWord(escapedDatapointTitle));
        
        assertEquals("Add a wildcard, and the search fails",
                0,
                searchWithKeyWord(escapedDatapointTitle + "*"));        
        
        assertEquals("Somehow, with the keyword + wildcard in quotes, this search now works",
                1,
                searchWithKeyWord("\"" + escapedDatapointTitle + "*\""));
        
        assertEquals("Without a wildcard, the search works",
                1,
                searchWithKeyWord(simpleDatapointTitle));
        
        assertEquals("Add a wildcard, and the search still works",
                1,
                searchWithKeyWord(simpleDatapointTitle + "*"));        
        
        assertEquals("With the keyword + wildcard in quotes, the search works",
                1,
                searchWithKeyWord("\"" + simpleDatapointTitle + "*\""));
        
        assertEquals("Without a wildcard, the partial term search doesn't work",
                0,
                searchWithKeyWord(truncatedSimpleDatapointTitle));
        
        assertEquals("Add a wildcard, and the search works",
                1,
                searchWithKeyWord(truncatedSimpleDatapointTitle + "*"));        
        
        assertEquals("With the keyword + wildcard in quotes, the search fails",
                0,
                searchWithKeyWord("\"" + truncatedSimpleDatapointTitle + "*\""));
        
        
    }

    private void addToIndex(String datapointTitle) throws CorruptIndexException, IOException {
        final Document doc = new Document();
        doc.add(new Field(FIELD, datapointTitle,
                Field.Store.YES, Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
    }

    private int searchWithKeyWord(String keyword) throws ParseException,
            CorruptIndexException, IOException {
        QueryParser parser = new QueryParser(FIELD, new StandardAnalyzer());
        Query query = parser.parse(keyword);
        System.out.println("keyword[" + keyword + "],class[" + query.getClass().getSimpleName() + "]" 
                + query.toString());
        
        return runQuery(query);
    }

    private int runQuery(Query query) throws CorruptIndexException, IOException {
        IndexReader reader = null;
        Searcher searcher = null;
        reader = IndexReader.open(ramDirectory, true);
        searcher = new IndexSearcher(reader);

        final TopDocs topDocs = searcher.search(query, null, 100);
        reader.close();
        searcher.close();
        return topDocs.totalHits;
    }
    
    private static TermQuery asTermQuery(String string) {
        return new TermQuery(new Term(FIELD, string.toLowerCase()));
    }
    
    private static PhraseQuery asPhraseQuery(String string) {
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.add(new Term(FIELD, string.toLowerCase()));
        return phraseQuery;
    }
    
    private void openIndex() throws IOException {
        indexWriter = new IndexWriter(ramDirectory, new StandardAnalyzer(), true, 
                IndexWriter.MaxFieldLength.LIMITED);
        indexWriter.setMergeFactor(100);
    }

    private synchronized void closeAndOptimizeIndex() throws IOException {
        indexWriter.optimize();
        indexWriter.commit();
        indexWriter.close();
    }    
    
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Issues with escaping special characters

Reply via email to