Lucene.NET based text triage
I have the following task that I need to implement in .NET. I get a block of text and need to assess whether this text is mostly readable or a bunch of unreadable garbage. This text is generated by processes like OCR. I am not looking to detect or correct small errors. Instead, I need to "triage" the text block and return TRUE if the whole block is more or less readable (as well as searchable etc) or FALSE if it's mostly garbage. My current plan is to: 1. Use Lucene.NET to index a large dictionary of English words 2. Tokenize the text, throwing out stopwords, words shorter than some minimum # of chars 3. Query each token against the index using some sort of fuzzy match that would give me not only the closest match to a given token from the dict but also the distance 4. Somehow combine individual distances to come up with a cumulative measure for the whole block of text 5. Compare it against some threshold and return FALSE if the measure is above the threshold and TRUE otherwise. Here are some questions: 1. Is there anything special I need to do during indexing of the dictionary to make the fuzzy matching work better? 2. What sort of fuzzy matching methods are available in Lucene.NET querying? Do they return distances for the closest matches? Does the choice of a matching method affect how indexing should be done? 3. Is there a way of running the whole block of text against the index at once rather than tokenizing and looping over tokens? Thanks much, Ilya Zavorin
Creating Span Queries from Boolean Queries
Hi Everyone, If there was a straightforward way to take a Boolean Query created by the Lucene Query Parser and convert it to a Span Query. Ideally I'd like to take any ANDed clauses and require them to occur withing $SPAN of the other ANDs. I can't quite wrap my head around how to solve the problem. Thanks! -Dave
Re: Creating Span Queries from Boolean Queries
Give us an example of what you are really trying to match. SpanNearQuery takes a list of clauses, which can be SpanTermQuery to match a single term or SpanNearQuery to match a nested span. You can specify the maximum distance between terms/spans - use nesting if you want to change that distance. That gives you a basic BooleanQuery with AND clauses converted to spans. -- Jack Krupansky -Original Message- From: Dave Seltzer Sent: Tuesday, August 21, 2012 6:53 PM To: java-user@lucene.apache.org Subject: Creating Span Queries from Boolean Queries Hi Everyone, If there was a straightforward way to take a Boolean Query created by the Lucene Query Parser and convert it to a Span Query. Ideally I'd like to take any ANDed clauses and require them to occur withing $SPAN of the other ANDs. I can't quite wrap my head around how to solve the problem. Thanks! -Dave - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Creating Span Queries from Boolean Queries
Well I was hoping that someone knew of a recursive solution to rewriting Boolean queries of arbitrary depth. I suppose If I can rewrite "london olympics" AND (football OR soccer) NOT nfl into "London Olympics" within_5_words_of (football or soccer) not_within_5_words_of nfl Then I should be able to use the same logic to operate on a BooleanQuery occurring within a BooleanClause. -D On Tue, Aug 21, 2012 at 7:26 PM, Jack Krupansky wrote: > Give us an example of what you are really trying to match. > > SpanNearQuery takes a list of clauses, which can be SpanTermQuery to match > a single term or SpanNearQuery to match a nested span. You can specify the > maximum distance between terms/spans - use nesting if you want to change > that distance. That gives you a basic BooleanQuery with AND clauses > converted to spans. > > -- Jack Krupansky > > -Original Message- From: Dave Seltzer > Sent: Tuesday, August 21, 2012 6:53 PM > To: java-user@lucene.apache.org > Subject: Creating Span Queries from Boolean Queries > > Hi Everyone, > > If there was a straightforward way to take a Boolean Query created by the > Lucene Query Parser and convert it to a Span Query. > > Ideally I'd like to take any ANDed clauses and require them to occur > withing $SPAN of the other ANDs. > > I can't quite wrap my head around how to solve the problem. > > Thanks! > > -Dave > > --**--**- > To unsubscribe, e-mail: > java-user-unsubscribe@lucene.**apache.org > For additional commands, e-mail: > java-user-help@lucene.apache.**org > > -- Dave Seltzer Chief Systems Architect TVEyes (203) 254-3600 x222
Re: Creating Span Queries from Boolean Queries
So I've taken my first shot at solving my problem using the three functions below. When I set the slop to 10 it produces the following result: This BooleanQuery +content:"london olympics" +(+content:football +content:or +content:soccer) -content:nfl becomes this SpanQuery: spanNot(spanNear([spanNear([content:london, content:olympics], 0, true), spanNear([content:football, content:or, content:soccer], 10, false)], 10, false), spanOr([content:nfl])) Right now I've implemented TermQuery, PhraseQuery and BooleanQuery. Is there a list of queries that could be produced using the Lucene Query Parser? Any thoughts on how I should implement Wildcard queries? Thanks! -Dave public static SpanQuery ConvertQuery(Query input, int slop) { SpanQuery convertedQuery = null; if(input instanceof TermQuery) { //support for term query convertedQuery = new SpanTermQuery(((TermQuery)input).getTerm()); } else if(input instanceof PhraseQuery) { //support for phrase query convertedQuery = ConvertPhraseQueryToSpanQuery((PhraseQuery)input); } else if(input instanceof BooleanQuery) { //support for nested boolean query convertedQuery = ConvertBooleanQuery((BooleanQuery)input,slop); } return convertedQuery; } public static SpanQuery ConvertPhraseQueryToSpanQuery(PhraseQuery input) { SpanQuery retval = null; ArrayList terms = new ArrayList(); for(Term t : input.getTerms()) { terms.add(new SpanTermQuery(t)); } retval = new SpanNearQuery(terms.toArray(new SpanQuery[terms.size()]), 0, true); return retval; } public static SpanQuery ConvertBooleanQuery(BooleanQuery input, int slop) { ArrayList andClauses = new ArrayList(); ArrayList orClauses = new ArrayList(); ArrayList notClauses = new ArrayList(); SpanQuery retval = null; //iterate thorough any child clauses prior to for(BooleanClause clause : ((BooleanQuery) input).clauses()) { SpanQuery convertedQuery = ConvertQuery(clause.getQuery(), slop); if(convertedQuery != null) { if(clause.getOccur() == BooleanClause.Occur.MUST) { andClauses.add(convertedQuery); } else if (clause.getOccur() == BooleanClause.Occur.SHOULD) { orClauses.add(convertedQuery); } else if (clause.getOccur() == BooleanClause.Occur.MUST_NOT) { notClauses.add(convertedQuery); } } } //alright, now lets assemble the clauses that we've collected for this query SpanQuery andSpans = null; SpanQuery orSpans = null; SpanQuery notSpans = null; //if there are no ANDs and no ORs then we'll return null if(andClauses.size() + orClauses.size() == 0) return null; if(andClauses.size() > 0) { if(andClauses.size() > 1) { andSpans = new SpanNearQuery(andClauses.toArray(new SpanQuery[andClauses.size()]), slop, false); } else { andSpans = andClauses.get(0); } } if(orClauses.size() > 0) { orSpans = new SpanOrQuery(orClauses.toArray(new SpanQuery[orClauses.size()])); } if(notClauses.size() > 0) { notSpans = new SpanOrQuery(notClauses.toArray(new SpanQuery[notClauses.size()])); } //build an intermediate query using the above clauses SpanQuery intermediateQuery = null; if(andClauses.size() > 0 && orClauses.size() == 0) { intermediateQuery = andSpans; } else if (orClauses.size() > 0 && andClauses.size() == 0) { intermediateQuery = orSpans; } else { intermediateQuery = new SpanNearQuery(new SpanQuery[]{andSpans,orSpans}, slop, false); } //if we have any NOT queries append them to the end if(notClauses.size() > 0) { retval = new SpanNotQuery(intermediateQuery, notSpans); } else { retval = intermediateQuery; } return retval; }