Lucene.NET based text triage

2012-08-21 Thread Ilya Zavorin
I have the following task that I need to implement in .NET. I get a block of 
text and need to assess whether this text is mostly readable or a bunch of 
unreadable garbage. This text is generated by processes like OCR. I am not 
looking to detect or correct small errors. Instead, I need to "triage" the text 
block and return TRUE if the whole block is more or less readable (as well as 
searchable etc) or FALSE if it's mostly garbage.

My current plan is to:

1.   Use Lucene.NET to index a large dictionary of English words

2.   Tokenize the text, throwing out stopwords, words shorter than some 
minimum # of chars

3.   Query each token against the index using some sort of fuzzy match that 
would give me not only the closest match to a given token from the dict but 
also the distance

4.   Somehow combine individual distances to come up with a cumulative 
measure for the whole block of text

5.   Compare it against some threshold and return FALSE if the measure is 
above the threshold and TRUE otherwise.

Here are some questions:

1.   Is there anything special I need to do during indexing of the 
dictionary to make the fuzzy matching work better?

2.   What sort of fuzzy matching methods are available in Lucene.NET 
querying? Do they return distances for the closest matches? Does the choice of 
a matching method affect how indexing should be done?

3.   Is there a way of running the whole block of text against the index at 
once rather than tokenizing and looping over tokens?

Thanks much,

Ilya Zavorin


Creating Span Queries from Boolean Queries

2012-08-21 Thread Dave Seltzer
Hi Everyone,

If there was a straightforward way to take a Boolean Query created by the
Lucene Query Parser and convert it to a Span Query.

Ideally I'd like to take any ANDed clauses and require them to occur
withing $SPAN of the other ANDs.

I can't quite wrap my head around how to solve the problem.

Thanks!

-Dave


Re: Creating Span Queries from Boolean Queries

2012-08-21 Thread Jack Krupansky

Give us an example of what you are really trying to match.

SpanNearQuery takes a list of clauses, which can be SpanTermQuery to match a 
single term or SpanNearQuery to match a nested span. You can specify the 
maximum distance between terms/spans - use nesting if you want to change 
that distance. That gives you a basic BooleanQuery with AND clauses 
converted to spans.


-- Jack Krupansky

-Original Message- 
From: Dave Seltzer

Sent: Tuesday, August 21, 2012 6:53 PM
To: java-user@lucene.apache.org
Subject: Creating Span Queries from Boolean Queries

Hi Everyone,

If there was a straightforward way to take a Boolean Query created by the
Lucene Query Parser and convert it to a Span Query.

Ideally I'd like to take any ANDed clauses and require them to occur
withing $SPAN of the other ANDs.

I can't quite wrap my head around how to solve the problem.

Thanks!

-Dave 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Creating Span Queries from Boolean Queries

2012-08-21 Thread Dave Seltzer
Well I was hoping that someone knew of a recursive solution to
rewriting Boolean queries of arbitrary depth.

I suppose If I can rewrite

"london olympics" AND (football OR soccer) NOT nfl

into

"London Olympics" within_5_words_of (football or soccer)
not_within_5_words_of nfl

Then I should be able to use the same logic to operate on a
BooleanQuery occurring within a BooleanClause.

-D

On Tue, Aug 21, 2012 at 7:26 PM, Jack Krupansky wrote:

> Give us an example of what you are really trying to match.
>
> SpanNearQuery takes a list of clauses, which can be SpanTermQuery to match
> a single term or SpanNearQuery to match a nested span. You can specify the
> maximum distance between terms/spans - use nesting if you want to change
> that distance. That gives you a basic BooleanQuery with AND clauses
> converted to spans.
>
> -- Jack Krupansky
>
> -Original Message- From: Dave Seltzer
> Sent: Tuesday, August 21, 2012 6:53 PM
> To: java-user@lucene.apache.org
> Subject: Creating Span Queries from Boolean Queries
>
> Hi Everyone,
>
> If there was a straightforward way to take a Boolean Query created by the
> Lucene Query Parser and convert it to a Span Query.
>
> Ideally I'd like to take any ANDed clauses and require them to occur
> withing $SPAN of the other ANDs.
>
> I can't quite wrap my head around how to solve the problem.
>
> Thanks!
>
> -Dave
>
> --**--**-
> To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.**apache.org
> For additional commands, e-mail: 
> java-user-help@lucene.apache.**org
>
>


-- 
Dave Seltzer 
Chief Systems Architect
TVEyes
(203) 254-3600 x222


Re: Creating Span Queries from Boolean Queries

2012-08-21 Thread Dave Seltzer
So I've taken my first shot at solving my problem using the three functions
below.

When I set the slop to 10 it produces the following result:
This BooleanQuery +content:"london olympics" +(+content:football
+content:or +content:soccer) -content:nfl

becomes this SpanQuery: spanNot(spanNear([spanNear([content:london,
content:olympics], 0, true), spanNear([content:football, content:or,
content:soccer], 10, false)], 10, false), spanOr([content:nfl]))

Right now I've implemented TermQuery, PhraseQuery and BooleanQuery.

Is there a list of queries that could be produced using the Lucene Query
Parser? Any thoughts on how I should implement Wildcard queries?

Thanks!

-Dave


public static SpanQuery ConvertQuery(Query input, int slop) {
SpanQuery convertedQuery = null;
if(input instanceof TermQuery) {
//support for term query
convertedQuery = new SpanTermQuery(((TermQuery)input).getTerm());
} else if(input instanceof PhraseQuery) {
//support for phrase query
convertedQuery = ConvertPhraseQueryToSpanQuery((PhraseQuery)input);
} else if(input instanceof BooleanQuery) {
//support for nested boolean query
convertedQuery = ConvertBooleanQuery((BooleanQuery)input,slop);
}
return convertedQuery;
}


public static SpanQuery ConvertPhraseQueryToSpanQuery(PhraseQuery input) {
SpanQuery retval = null;
ArrayList terms = new ArrayList();
for(Term t : input.getTerms())
{
terms.add(new SpanTermQuery(t));
}
retval = new SpanNearQuery(terms.toArray(new SpanQuery[terms.size()]), 0,
true);
return retval;
}

public static SpanQuery ConvertBooleanQuery(BooleanQuery input, int slop) {
ArrayList andClauses = new ArrayList();
ArrayList orClauses = new ArrayList();
ArrayList notClauses = new ArrayList();
SpanQuery retval = null;

//iterate thorough any child clauses prior to
for(BooleanClause clause : ((BooleanQuery) input).clauses()) {
SpanQuery convertedQuery = ConvertQuery(clause.getQuery(), slop);
 if(convertedQuery != null)
{
if(clause.getOccur() == BooleanClause.Occur.MUST) {
andClauses.add(convertedQuery);
} else if (clause.getOccur() == BooleanClause.Occur.SHOULD) {
orClauses.add(convertedQuery);
} else if (clause.getOccur() == BooleanClause.Occur.MUST_NOT) {
notClauses.add(convertedQuery);
}
}
}
 //alright, now lets assemble the clauses that we've collected for this
query
SpanQuery andSpans = null;
SpanQuery  orSpans = null;
SpanQuery notSpans = null;
 //if there are no ANDs and no ORs then we'll return null
if(andClauses.size() + orClauses.size() == 0)
return null;
 if(andClauses.size() > 0) {
if(andClauses.size() > 1) {
andSpans = new SpanNearQuery(andClauses.toArray(new
SpanQuery[andClauses.size()]), slop, false);
} else {
andSpans = andClauses.get(0);
}
}
if(orClauses.size() > 0) {
orSpans = new SpanOrQuery(orClauses.toArray(new
SpanQuery[orClauses.size()]));
}
if(notClauses.size() > 0) {
notSpans = new SpanOrQuery(notClauses.toArray(new
SpanQuery[notClauses.size()]));
}
 //build an intermediate query using the above clauses
SpanQuery intermediateQuery = null;
if(andClauses.size() > 0 && orClauses.size() == 0) {
intermediateQuery = andSpans;
} else if (orClauses.size() > 0 && andClauses.size() == 0) {
intermediateQuery = orSpans;
} else {
intermediateQuery = new SpanNearQuery(new SpanQuery[]{andSpans,orSpans},
slop, false);
}
 //if we have any NOT queries append them to the end
if(notClauses.size() > 0) {
retval = new SpanNotQuery(intermediateQuery, notSpans);
} else {
retval = intermediateQuery;
}
 return retval;
}