Lucene.NET based text triage

2012-08-21 Thread Ilya Zavorin
I have the following task that I need to implement in .NET. I get a block of text and need to assess whether this text is mostly readable or a bunch of unreadable garbage. This text is generated by processes like OCR. I am not looking to detect or correct small errors. Instead, I need to "triage

Creating Span Queries from Boolean Queries

2012-08-21 Thread Dave Seltzer
Hi Everyone, If there was a straightforward way to take a Boolean Query created by the Lucene Query Parser and convert it to a Span Query. Ideally I'd like to take any ANDed clauses and require them to occur withing $SPAN of the other ANDs. I can't quite wrap my head around how to solve the prob

Re: Creating Span Queries from Boolean Queries

2012-08-21 Thread Jack Krupansky
Give us an example of what you are really trying to match. SpanNearQuery takes a list of clauses, which can be SpanTermQuery to match a single term or SpanNearQuery to match a nested span. You can specify the maximum distance between terms/spans - use nesting if you want to change that distanc

Re: Creating Span Queries from Boolean Queries

2012-08-21 Thread Dave Seltzer
Well I was hoping that someone knew of a recursive solution to rewriting Boolean queries of arbitrary depth. I suppose If I can rewrite "london olympics" AND (football OR soccer) NOT nfl into "London Olympics" within_5_words_of (football or soccer) not_within_5_words_of nfl Then I should be ab

Re: Creating Span Queries from Boolean Queries

2012-08-21 Thread Dave Seltzer
So I've taken my first shot at solving my problem using the three functions below. When I set the slop to 10 it produces the following result: This BooleanQuery +content:"london olympics" +(+content:football +content:or +content:soccer) -content:nfl becomes this SpanQuery: spanNot(spanNear([spanN