SpanNearQuery behaviour?

Yu Zhou Mon, 04 Nov 2013 11:20:08 -0800

Hi,

We use SpanNearQueries intensively for proximity searching. However, we are 
confused by two different ways to use them. Could anybody explain in details 
what we can expect for nested and flatten SpanNearQueries?


We used to build nested SpanNearQueries. However, we found that using nested 
SpanNearQueries doesn't always work. We also tried to switch to flatten 
SpanNearQueries. Then we found out that it breaks in some other cases. Below, 
we're including some test cases for both scenarios.

Another observation is that some of those failed queries include repreating 
terms. Further we don't fully undertand the concept of span overlaps and how 
they impact searches, can you shed some light on this.

All examples below are slop=2, inOrder=false. And we are using Lucene 4.4.0.

Attached is a program that will show all cases described below.


-------------------------------------------

Examples:

1) nested queries:

context: an exact phrase of each query below is in a document

a) Failing case: KE : a b c d d b c e 

b) Failing case: KE : one ring to rule them all one ring to find them one ring 
to bring


2) flatten queries:

Context: a phrase of "Task Force on Teaching as a Profession" is in a document

a) Failing case: SU: Force Teaching Profession 

b) Working case: SU: Force on Teaching Profession

c) both above cases work on nested SpanNearQueries


3) a specific query is interesting and a bit confusing: 

context: It is a long query that there is an exact match in an document. 

TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs 
Non Partisan Report Documents Bush Administration Abstinence Only Programs 
Running Amuck With Over One Billion in Taxpayer Dollars

a) Failing case: for nested query, it fails for the whole sentence
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs 
Non Partisan Report Documents Bush Administration Abstinence Only Programs 
Running Amuck With Over One Billion in Taxpayer Dollars

b) Working case: for nested query, it works for the up to 19 terms in the query
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs 
Non Partisan Report Documents Bush Administration Abstinence Only Programs

c) Failing case: for nested query, adding an term at the end of the 19 term 
query, it fails
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs 
Non Partisan Report Documents Bush Administration Abstinence Only Programs 
Running 

d) Working case: for nested query, adding an term at the beginning of the 19 
term query, it works
TI: And A New Congress Should Enforce Accountability Over Abstinence Only 
Programs Non Partisan Report Documents Bush Administration Abstinence Only 
Programs

e) Working case: for nested query, adding an term at the beginning of the 
query, it works for the whole sentence
TI: And A New Congress Should Enforce Accountability Over Abstinence Only 
Programs Non Partisan Report Documents Bush Administration Abstinence Only 
Programs Running Amuck With Over One Billion in Taxpayer Dollars

f) Working case: for flatten structure, it works for the whole sentence
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs 
Non Partisan Report Documents Bush Administration Abstinence Only Programs 
Running Amuck With Over One Billion in Taxpayer Dollars


Thanks.

Jerry

import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

public class SpanTest {
	private static boolean flatSpanNear = true;
	private static IndexSearcher searcher;

	public static void main(String[] args) throws IOException {
		RAMDirectory directory = new RAMDirectory();
		Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44, CharArraySet.EMPTY_SET);
		IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_44, analyzer);
		IndexWriter writer = new IndexWriter(directory, conf);
		
		addDocument(writer, "SU", "Task Force on Teaching as a Profession");
		addDocument(writer, "KE", "to be or not to be");
		addDocument(writer, "TI", "A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
		addDocument(writer, "TI", "one ring to rule them all one ring to find them one ring to bring");
		addDocument(writer, "KE", "a b c d d b c e");
		addDocument(writer, "TX", "And A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
		
		writer.commit();
		writer.close();
			
		IndexReader r = DirectoryReader.open(directory);
		searcher = new IndexSearcher(r);

		// Case 1)
		performNestSpanSearch("KE a b c d d b c e ");
		performNestSpanSearch("KE one ring to rule them all one ring to find them one ring to bring");
	
		// Case 2)
		performFlattenSpanSearch("SU Force Teaching Profession");
		performFlattenSpanSearch("SU Force on Teaching Profession");
		
		performNestSpanSearch("SU Force Teaching Profession");
		performNestSpanSearch("SU Force on Teaching Profession");
		
		// Case 3)
		// failing
		performNestSpanSearch("TI A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
		// working
		performNestSpanSearch("TI A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs");
		// failing
		performNestSpanSearch("TI A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs Running");
		// Working
		performNestSpanSearch("TX And A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Only Programs");
		performNestSpanSearch("TX And A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
		
		performFlattenSpanSearch("TI A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
			
	}

	protected static void performNestSpanSearch(String queryStr) throws IOException {
		//Query query = getNestedSpans(queryStr);
		Query query = getNestedSpanQuery(queryStr);
		TopDocs topDocs = searcher.search(query, 10);
		if (topDocs.totalHits == 0) { 
			System.out.println("Failing case : " + queryStr);
		} else {
			System.out.println("Working case : " + queryStr);
		}
	}

	protected static void performFlattenSpanSearch(String queryStr) throws IOException {
		SpanQuery[] clauses = getSpans(queryStr);
		Query query = new SpanNearQuery(clauses, 2, false);
		TopDocs topDocs = searcher.search(query, 10);
		if (topDocs.totalHits == 0) { 
			System.out.println("Failing case : " + queryStr);
		} else {
			System.out.println("Working case : " + queryStr);
		}
	}
	
	

	protected static void addDocument(IndexWriter writer, String field, String value) throws IOException {
		Document document = new Document();
		document.add(new Field(field, value, Store.YES, Index.ANALYZED));
		writer.addDocument(document);
	}

	static private SpanNearQuery getNestedSpanQuery(String queryStr) throws IOException {
		String[] splits = queryStr.split(" ");
		String field = splits[0];
		SpanQuery parentClause = null;
		SpanNearQuery spanNearQuery = null;
		for (int i = 1 ; i < splits.length; i++) {
			SpanTermQuery spanTermQuery = new SpanTermQuery(new Term(field, splits[i].toLowerCase()));
			SpanQuery[] clauses = parentClause == null ? new SpanQuery[] { spanTermQuery } : new SpanQuery[] { parentClause, spanTermQuery };
			spanNearQuery = new SpanNearQuery(clauses, 2, false);
			parentClause = spanNearQuery;
		}
		return spanNearQuery;
	}
	
	private static SpanQuery[] getSpans(String queryStr) {
		String[] splits = queryStr.split(" ");
		SpanTermQuery[] spans = new SpanTermQuery[splits.length - 1];
		String field = splits[0];
		for (int i = 1; i < splits.length; i++) {
			spans[i - 1] = new SpanTermQuery(new Term(field, splits[i].toLowerCase()));
		}
		return spans;
	}

}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

SpanNearQuery behaviour?

Reply via email to