Re: Search within a sentence (revisited)

Peter Keegan Tue, 26 Jul 2011 05:56:51 -0700

Thanks Mark! The new patch is working fine with the tests and a few more. If
you have particular test cases in mind, I'd be happy to add them.


Thanks,
Peter

On Mon, Jul 25, 2011 at 5:56 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Sorry Peter - I introduced this problem with some kind of typo type issue -
> I somehow changed an includeSpans variable to excludeSpans - but I certainly
> didn't mean too - it makes no sense. So not sure how it happened, and
> surprised the tests that passed still passed!
>
> We could probably use even more tests before feeling too confident here…
>
> I've attached a patch for 3X with the new test and fix (changed that
> include back to exclude).
>
> - Mark Miller
> lucidimagination.com
>
> On Jul 25, 2011, at 10:29 AM, Mark Miller wrote:
>
> > Thanks Peter - if you supply the unit tests, I'm happy to work on the
> fixes.
> >
> > I can likely look at this later today.
> >
> > - Mark Miller
> > lucidimagination.com
> >
> > On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:
> >
> >> Hi Mark,
> >>
> >> Sorry to bug you again, but there's another case that fails the unit
> test
> >> (search within the second sentence), as shown here in the last test:
> >>
> >> package org.apache.lucene.search.spans;
> >>
> >> import java.io.Reader;
> >>
> >> import org.apache.lucene.analysis.Analyzer;
> >> import org.apache.lucene.analysis.TokenStream;
> >> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
> >> import
> >> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
> >> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >> import org.apache.lucene.document.Document;
> >> import org.apache.lucene.document.Field;
> >> import org.apache.lucene.index.IndexReader;
> >> import org.apache.lucene.index.RandomIndexWriter;
> >> import org.apache.lucene.index.Term;
> >> import org.apache.lucene.store.Directory;
> >> import org.apache.lucene.search.IndexSearcher;
> >> import org.apache.lucene.search.PhraseQuery;
> >> import org.apache.lucene.search.ScoreDoc;
> >> import org.apache.lucene.search.TermQuery;
> >> import org.apache.lucene.search.spans.SpanNearQuery;
> >> import org.apache.lucene.search.spans.SpanQuery;
> >> import org.apache.lucene.search.spans.SpanTermQuery;
> >> import org.apache.lucene.util.LuceneTestCase;
> >>
> >> public class TestSentence extends LuceneTestCase {
> >> public static final String field = "field";
> >> public static final String START = "^";
> >> public static final String END = "$";
> >> public void testSetPosition() throws Exception {
> >> Analyzer analyzer = new Analyzer() {
> >> @Override
> >> public TokenStream tokenStream(String fieldName, Reader reader) {
> >> return new TokenStream() {
> >> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END,
> >> "9"};
> >> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
> >> private int i = 0;
> >> PositionIncrementAttribute posIncrAtt =
> >> addAttribute(PositionIncrementAttribute.class);
> >> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
> >> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
> >> @Override
> >> public boolean incrementToken() {
> >> assertEquals(TOKENS.length, INCREMENTS.length);
> >> if (i == TOKENS.length)
> >> return false;
> >> clearAttributes();
> >> termAtt.append(TOKENS[i]);
> >> offsetAtt.setOffset(i,i);
> >> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
> >> i++;
> >> return true;
> >> }
> >> };
> >> }
> >> };
> >> Directory store = newDirectory();
> >> RandomIndexWriter writer = new RandomIndexWriter(random, store,
> analyzer);
> >> Document d = new Document();
> >> d.add(newField("field", "bogus", Field.Store.YES,
> Field.Index.ANALYZED));
> >> writer.addDocument(d);
> >> IndexReader reader = writer.getReader();
> >> writer.close();
> >> IndexSearcher searcher = newSearcher(reader);
> >> SpanTermQuery startSentence = makeSpanTermQuery(START);
> >> SpanTermQuery endSentence = makeSpanTermQuery(END);
> >> SpanQuery[] clauses = new SpanQuery[2];
> >> clauses[0] = makeSpanTermQuery("1");
> >> clauses[1] = makeSpanTermQuery("2");
> >> SpanNearQuery allKeywords = new SpanNearQuery(clauses,
> Integer.MAX_VALUE,
> >> false); // SpanAndQuery equivalent
> >> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence,
> 0);
> >> System.out.println("query: "+query);
> >> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
> >> assertEquals(1, hits.length);
> >> clauses[1] = makeSpanTermQuery("4");
> >> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
> >> SpanAndQuery equivalent
> >> query = new SpanWithinQuery(allKeywords, endSentence, 0);
> >> System.out.println("query: "+query);
> >> hits = searcher.search(query, null, 1000).scoreDocs;
> >> assertEquals(0, hits.length);
> >> PhraseQuery pq = new PhraseQuery();
> >> pq.add(new Term(field, "3"));
> >> pq.add(new Term(field, "4"));
> >> System.out.println("query: "+pq);
> >> hits = searcher.search(pq, null, 1000).scoreDocs;
> >> assertEquals(1, hits.length);
> >> clauses[0] = makeSpanTermQuery("4");
> >> clauses[1] = makeSpanTermQuery("6");
> >> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); //
> >> SpanAndQuery equivalent
> >> query = new SpanWithinQuery(allKeywords, endSentence, 0);
> >> System.out.println("query: "+query);
> >> hits = searcher.search(query, null, 1000).scoreDocs;
> >> assertEquals(1, hits.length);
> >> }
> >>
> >> public SpanTermQuery makeSpanTermQuery(String text) {
> >> return new SpanTermQuery(new Term(field, text));
> >> }
> >> public TermQuery makeTermQuery(String text) {
> >> return new TermQuery(new Term(field, text));
> >> }
> >> }
> >>
> >> Peter
> >>
> >> On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >>
> >>>
> >>> I just uploaded a patch for 3X that will work for 3.2.
> >>>
> >>> On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:
> >>>
> >>>> Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to
> >>> change that to an IndexReader I believe.
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:
> >>>>
> >>>>> Does this patch require the trunk version? I'm using 3.2 and
> >>>>> 'AtomicReaderContext' isn't there.
> >>>>>
> >>>>> Peter
> >>>>>
> >>>>> On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller <markrmil...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Hey Peter,
> >>>>>>
> >>>>>> Getting sucked back into Spans...
> >>>>>>
> >>>>>> That test should pass now - I uploaded a new patch to
> >>>>>> https://issues.apache.org/jira/browse/LUCENE-777
> >>>>>>
> >>>>>> Further tests may be needed though.
> >>>>>>
> >>>>>> - Mark
> >>>>>>
> >>>>>>
> >>>>>> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:
> >>>>>>
> >>>>>>> Hi Mark,
> >>>>>>>
> >>>>>>> Here is a unit test using a version of 'SpanWithinQuery' modified
> for
> >>> 3.2
> >>>>>>> ('getTerms' removed) . The last test fails (search for "1" and
> "3").
> >>>>>>>
> >>>>>>> package org.apache.lucene.search.spans;
> >>>>>>>
> >>>>>>> import java.io.Reader;
> >>>>>>>
> >>>>>>> import org.apache.lucene.analysis.Analyzer;
> >>>>>>> import org.apache.lucene.analysis.TokenStream;
> >>>>>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
> >>>>>>> import
> >>>>>>>
> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
> >>>>>>> import
> org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >>>>>>> import org.apache.lucene.document.Document;
> >>>>>>> import org.apache.lucene.document.Field;
> >>>>>>> import org.apache.lucene.index.IndexReader;
> >>>>>>> import org.apache.lucene.index.RandomIndexWriter;
> >>>>>>> import org.apache.lucene.index.Term;
> >>>>>>> import org.apache.lucene.store.Directory;
> >>>>>>> import org.apache.lucene.search.IndexSearcher;
> >>>>>>> import org.apache.lucene.search.PhraseQuery;
> >>>>>>> import org.apache.lucene.search.ScoreDoc;
> >>>>>>> import org.apache.lucene.search.TermQuery;
> >>>>>>> import org.apache.lucene.search.spans.SpanNearQuery;
> >>>>>>> import org.apache.lucene.search.spans.SpanQuery;
> >>>>>>> import org.apache.lucene.search.spans.SpanTermQuery;
> >>>>>>> import org.apache.lucene.util.LuceneTestCase;
> >>>>>>>
> >>>>>>> public class TestSentence extends LuceneTestCase {
> >>>>>>> public static final String field = "field";
> >>>>>>> public static final String START = "^";
> >>>>>>> public static final String END = "$";
> >>>>>>> public void testSetPosition() throws Exception {
> >>>>>>> Analyzer analyzer = new Analyzer() {
> >>>>>>> @Override
> >>>>>>> public TokenStream tokenStream(String fieldName, Reader reader) {
> >>>>>>> return new TokenStream() {
> >>>>>>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6",
> >>> END,
> >>>>>>> "9"};
> >>>>>>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1};
> >>>>>>> private int i = 0;
> >>>>>>>
> >>>>>>> PositionIncrementAttribute posIncrAtt =
> >>>>>>> addAttribute(PositionIncrementAttribute.class);
> >>>>>>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
> >>>>>>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
> >>>>>>>
> >>>>>>> @Override
> >>>>>>> public boolean incrementToken() {
> >>>>>>> assertEquals(TOKENS.length, INCREMENTS.length);
> >>>>>>> if (i == TOKENS.length)
> >>>>>>> return false;
> >>>>>>> clearAttributes();
> >>>>>>> termAtt.append(TOKENS[i]);
> >>>>>>> offsetAtt.setOffset(i,i);
> >>>>>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
> >>>>>>> i++;
> >>>>>>> return true;
> >>>>>>> }
> >>>>>>> };
> >>>>>>> }
> >>>>>>> };
> >>>>>>> Directory store = newDirectory();
> >>>>>>> RandomIndexWriter writer = new RandomIndexWriter(random, store,
> >>>>>> analyzer);
> >>>>>>> Document d = new Document();
> >>>>>>> d.add(newField("field", "bogus", Field.Store.YES,
> >>> Field.Index.ANALYZED));
> >>>>>>> writer.addDocument(d);
> >>>>>>> IndexReader reader = writer.getReader();
> >>>>>>> writer.close();
> >>>>>>> IndexSearcher searcher = newSearcher(reader);
> >>>>>>>
> >>>>>>> SpanTermQuery startSentence = makeSpanTermQuery(START);
> >>>>>>> SpanTermQuery endSentence = makeSpanTermQuery(END);
> >>>>>>> SpanQuery[] clauses = new SpanQuery[2];
> >>>>>>> clauses[0] = makeSpanTermQuery("1");
> >>>>>>> clauses[1] = makeSpanTermQuery("2");
> >>>>>>> SpanNearQuery allKeywords = new SpanNearQuery(clauses,
> >>> Integer.MAX_VALUE,
> >>>>>>> false); // SpanAndQuery equivalent
> >>>>>>> SpanWithinQuery query = new SpanWithinQuery(allKeywords,
> endSentence,
> >>> 0);
> >>>>>>> System.out.println("query: "+query);
> >>>>>>> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
> >>>>>>> assertEquals(hits.length, 1);
> >>>>>>>
> >>>>>>> clauses[1] = makeSpanTermQuery("4");
> >>>>>>> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false);
> //
> >>>>>>> SpanAndQuery equivalent
> >>>>>>> query = new SpanWithinQuery(allKeywords, endSentence, 0);
> >>>>>>> System.out.println("query: "+query);
> >>>>>>> hits = searcher.search(query, null, 1000).scoreDocs;
> >>>>>>> assertEquals(hits.length, 0);
> >>>>>>>
> >>>>>>> PhraseQuery pq = new PhraseQuery();
> >>>>>>> pq.add(new Term(field, "3"));
> >>>>>>> pq.add(new Term(field, "4"));
> >>>>>>> hits = searcher.search(pq, null, 1000).scoreDocs;
> >>>>>>> assertEquals(hits.length, 1);
> >>>>>>>
> >>>>>>> clauses[1] = makeSpanTermQuery("3");
> >>>>>>> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false);
> //
> >>>>>>> SpanAndQuery equivalent
> >>>>>>> query = new SpanWithinQuery(allKeywords, endSentence, 0);
> >>>>>>> System.out.println("query: "+query);
> >>>>>>> hits = searcher.search(query, null, 1000).scoreDocs;
> >>>>>>> assertEquals(hits.length, 1);
> >>>>>>>
> >>>>>>>
> >>>>>>> }
> >>>>>>>
> >>>>>>> public SpanTermQuery makeSpanTermQuery(String text) {
> >>>>>>> return new SpanTermQuery(new Term(field, text));
> >>>>>>> }
> >>>>>>> public TermQuery makeTermQuery(String text) {
> >>>>>>> return new TermQuery(new Term(field, text));
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> Peter
> >>>>>>>
> >>>>>>> On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <
> markrmil...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:
> >>>>>>>>>
> >>>>>>>>>> Mark Miller's 'SpanWithinQuery' patch
> >>>>>>>>>> seems to have the same issue.
> >>>>>>>>>
> >>>>>>>>> If I remember right (It's been more the a couple years), I did
> index
> >>>>>> the
> >>>>>>>> sentence markers at the same position as the last word in the
> >>> sentence.
> >>>>>> And
> >>>>>>>> I think the limitation that I ate was that the word could belong
> to
> >>> both
> >>>>>>>> it's true sentence, and the one after it.
> >>>>>>>>>
> >>>>>>>>> - Mark Miller
> >>>>>>>>> lucidimagination.com
> >>>>>>>>
> >>>>>>>> Perhaps you could index the sentence marker at both the last word
> of
> >>> the
> >>>>>>>> sentence as well as the first word of the next sentence if there
> is
> >>> one.
> >>>>>>>> This would seem to solve the above limitation as well?
> >>>>>>>>
> >>>>>>>> - Mark Miller
> >>>>>>>> lucidimagination.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>> - Mark Miller
> >>>>>> lucidimagination.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>
> >>>> - Mark Miller
> >>>> lucidimagination.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>> - Mark Miller
> >>> lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >>>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Search within a sentence (revisited)

Reply via email to