Hi,
We use SpanNearQueries intensively for proximity searching. However, we are
confused by two different ways to use them. Could anybody explain in details
what we can expect for nested and flatten SpanNearQueries?
We used to build nested SpanNearQueries. However, we found that using nested
SpanNearQueries doesn't always work. We also tried to switch to flatten
SpanNearQueries. Then we found out that it breaks in some other cases. Below,
we're including some test cases for both scenarios.
Another observation is that some of those failed queries include repreating
terms. Further we don't fully undertand the concept of span overlaps and how
they impact searches, can you shed some light on this.
All examples below are slop=2, inOrder=false. And we are using Lucene 4.4.0.
Attached is a program that will show all cases described below.
-------------------------------------------
Examples:
1) nested queries:
context: an exact phrase of each query below is in a document
a) Failing case: KE : a b c d d b c e
b) Failing case: KE : one ring to rule them all one ring to find them one ring
to bring
2) flatten queries:
Context: a phrase of "Task Force on Teaching as a Profession" is in a document
a) Failing case: SU: Force Teaching Profession
b) Working case: SU: Force on Teaching Profession
c) both above cases work on nested SpanNearQueries
3) a specific query is interesting and a bit confusing:
context: It is a long query that there is an exact match in an document.
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs
Non Partisan Report Documents Bush Administration Abstinence Only Programs
Running Amuck With Over One Billion in Taxpayer Dollars
a) Failing case: for nested query, it fails for the whole sentence
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs
Non Partisan Report Documents Bush Administration Abstinence Only Programs
Running Amuck With Over One Billion in Taxpayer Dollars
b) Working case: for nested query, it works for the up to 19 terms in the query
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs
Non Partisan Report Documents Bush Administration Abstinence Only Programs
c) Failing case: for nested query, adding an term at the end of the 19 term
query, it fails
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs
Non Partisan Report Documents Bush Administration Abstinence Only Programs
Running
d) Working case: for nested query, adding an term at the beginning of the 19
term query, it works
TI: And A New Congress Should Enforce Accountability Over Abstinence Only
Programs Non Partisan Report Documents Bush Administration Abstinence Only
Programs
e) Working case: for nested query, adding an term at the beginning of the
query, it works for the whole sentence
TI: And A New Congress Should Enforce Accountability Over Abstinence Only
Programs Non Partisan Report Documents Bush Administration Abstinence Only
Programs Running Amuck With Over One Billion in Taxpayer Dollars
f) Working case: for flatten structure, it works for the whole sentence
TI: A New Congress Should Enforce Accountability Over Abstinence Only Programs
Non Partisan Report Documents Bush Administration Abstinence Only Programs
Running Amuck With Over One Billion in Taxpayer Dollars
Thanks.
Jerry
import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
public class SpanTest {
private static boolean flatSpanNear = true;
private static IndexSearcher searcher;
public static void main(String[] args) throws IOException {
RAMDirectory directory = new RAMDirectory();
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_44, CharArraySet.EMPTY_SET);
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_44, analyzer);
IndexWriter writer = new IndexWriter(directory, conf);
addDocument(writer, "SU", "Task Force on Teaching as a Profession");
addDocument(writer, "KE", "to be or not to be");
addDocument(writer, "TI", "A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
addDocument(writer, "TI", "one ring to rule them all one ring to find them one ring to bring");
addDocument(writer, "KE", "a b c d d b c e");
addDocument(writer, "TX", "And A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
writer.commit();
writer.close();
IndexReader r = DirectoryReader.open(directory);
searcher = new IndexSearcher(r);
// Case 1)
performNestSpanSearch("KE a b c d d b c e ");
performNestSpanSearch("KE one ring to rule them all one ring to find them one ring to bring");
// Case 2)
performFlattenSpanSearch("SU Force Teaching Profession");
performFlattenSpanSearch("SU Force on Teaching Profession");
performNestSpanSearch("SU Force Teaching Profession");
performNestSpanSearch("SU Force on Teaching Profession");
// Case 3)
// failing
performNestSpanSearch("TI A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
// working
performNestSpanSearch("TI A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs");
// failing
performNestSpanSearch("TI A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs Running");
// Working
performNestSpanSearch("TX And A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Only Programs");
performNestSpanSearch("TX And A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
performFlattenSpanSearch("TI A New Congress Should Enforce Accountability Over Abstinence Only Programs Non Partisan Report Documents Bush Administration Abstinence Only Programs Running Amuck With Over One Billion in Taxpayer Dollars");
}
protected static void performNestSpanSearch(String queryStr) throws IOException {
//Query query = getNestedSpans(queryStr);
Query query = getNestedSpanQuery(queryStr);
TopDocs topDocs = searcher.search(query, 10);
if (topDocs.totalHits == 0) {
System.out.println("Failing case : " + queryStr);
} else {
System.out.println("Working case : " + queryStr);
}
}
protected static void performFlattenSpanSearch(String queryStr) throws IOException {
SpanQuery[] clauses = getSpans(queryStr);
Query query = new SpanNearQuery(clauses, 2, false);
TopDocs topDocs = searcher.search(query, 10);
if (topDocs.totalHits == 0) {
System.out.println("Failing case : " + queryStr);
} else {
System.out.println("Working case : " + queryStr);
}
}
protected static void addDocument(IndexWriter writer, String field, String value) throws IOException {
Document document = new Document();
document.add(new Field(field, value, Store.YES, Index.ANALYZED));
writer.addDocument(document);
}
static private SpanNearQuery getNestedSpanQuery(String queryStr) throws IOException {
String[] splits = queryStr.split(" ");
String field = splits[0];
SpanQuery parentClause = null;
SpanNearQuery spanNearQuery = null;
for (int i = 1 ; i < splits.length; i++) {
SpanTermQuery spanTermQuery = new SpanTermQuery(new Term(field, splits[i].toLowerCase()));
SpanQuery[] clauses = parentClause == null ? new SpanQuery[] { spanTermQuery } : new SpanQuery[] { parentClause, spanTermQuery };
spanNearQuery = new SpanNearQuery(clauses, 2, false);
parentClause = spanNearQuery;
}
return spanNearQuery;
}
private static SpanQuery[] getSpans(String queryStr) {
String[] splits = queryStr.split(" ");
SpanTermQuery[] spans = new SpanTermQuery[splits.length - 1];
String field = splits[0];
for (int i = 1; i < splits.length; i++) {
spans[i - 1] = new SpanTermQuery(new Term(field, splits[i].toLowerCase()));
}
return spans;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org