[ 
https://issues.apache.org/jira/browse/LUCENE-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated LUCENE-5205:
--------------------------------

    Attachment: LUCENE-5205-cleanup-tests.patch

This patch focuses on reducing code duplication in the test cases.  Reducing 
duplication in the main part of code will be separate patch.

1) TestSpanQPBasedOnQPTestBase (renamed to TestQPTestBaseSpanQuery)...now 
subclasses QueryParserTestBase.
  There were a few handfuls of tests that I couldn't easily modify; mostly 
these were string equality tests for complex queries. The CJK examples where 
complex truth queries were built programmatically also had to be rewritten for 
SpanQueries. Those now exist in testParserSpecificQuery()...solution is not 
elegant, and I don't like the name.  It would be slightly cleaner to move those 
handfuls of tests up into TestQueryParser, but I want them to be available to 
the other subclasses of QueryParserTestBase.

2) TestComplexPhraseSpanQuery now subclasses TestComplexPhraseQuery.  Again, 
had to add a testParserSpecificSyntax.  However, in this case the syntax btwn 
the two parsers is slightly different.
 
3) TestMultiAnalyzer (renamed to TestMultiAnalyzerSpanQuery) now subclasses 
TestMultiAnalyzer.  These tests were mostly string equality tests on complex 
queries, and I couldn't easily keep any of the tests. 

For the above, I'm sure there are more elegant solutions, but this is where the 
code is for now.



Small clean up in SpanQueryParserBase:
1) got rid of special option to lowercase regex...treat like any other 
multiterm, and eventually get rid of lowercasing multiterms altogether!

2) got rid of special rounding correction for fuzzy minSims.  Behavior is now 
the same as classic.


Bugs found:
1) failed to set boost on MatchAllDocsQuery
2) fixed analysis of range terms (the irony :( )

Capabilities not currently covered by SQP
1) Known: && ! syntax
2) Unknown: "term phrase term" ->
        "+term +(+phrase1 +phrase2) +term"

> [PATCH] SpanQueryParser with recursion, analysis and syntax very similar to 
> classic QueryParser
> -----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5205
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/queryparser
>            Reporter: Tim Allison
>              Labels: patch
>             Fix For: 4.7
>
>         Attachments: LUCENE-5205-cleanup-tests.patch, 
> LUCENE-5205-date-pkg-prvt.patch, LUCENE-5205.patch.gz, LUCENE-5205.patch.gz, 
> LUCENE-5205_dateTestReInitPkgPrvt.patch, LUCENE-5205_smallTestMods.patch, 
> LUCENE_5205.patch, SpanQueryParser_v1.patch.gz, patch.txt
>
>
> This parser extends QueryParserBase and includes functionality from:
> * Classic QueryParser: most of its syntax
> * SurroundQueryParser: recursive parsing for "near" and "not" clauses.
> * ComplexPhraseQueryParser: can handle "near" queries that include multiterms 
> (wildcard, fuzzy, regex, prefix),
> * AnalyzingQueryParser: has an option to analyze multiterms.
> At a high level, there's a first pass BooleanQuery/field parser and then a 
> span query parser handles all terminal nodes and phrases.
> Same as classic syntax:
> * term: test 
> * fuzzy: roam~0.8, roam~2
> * wildcard: te?t, test*, t*st
> * regex: /\[mb\]oat/
> * phrase: "jakarta apache"
> * phrase with slop: "jakarta apache"~3
> * default "or" clause: jakarta apache
> * grouping "or" clause: (jakarta apache)
> * boolean and +/-: (lucene OR apache) NOT jakarta; +lucene +apache -jakarta
> * multiple fields: title:lucene author:hatcher
>  
> Main additions in SpanQueryParser syntax vs. classic syntax:
> * Can require "in order" for phrases with slop with the \~> operator: 
> "jakarta apache"\~>3
> * Can specify "not near": "fever bieber"!\~3,10 ::
>     find "fever" but not if "bieber" appears within 3 words before or 10 
> words after it.
> * Fully recursive phrasal queries with \[ and \]; as in: \[\[jakarta 
> apache\]~3 lucene\]\~>4 :: 
>     find "jakarta" within 3 words of "apache", and that hit has to be within 
> four words before "lucene"
> * Can also use \[\] for single level phrasal queries instead of " as in: 
> \[jakarta apache\]
> * Can use "or grouping" clauses in phrasal queries: "apache (lucene solr)"\~3 
> :: find "apache" and then either "lucene" or "solr" within three words.
> * Can use multiterms in phrasal queries: "jakarta\~1 ap*che"\~2
> * Did I mention full recursion: \[\[jakarta\~1 ap*che\]\~2 (solr~ 
> /l\[ou\]\+\[cs\]\[en\]\+/)]\~10 :: Find something like "jakarta" within two 
> words of "ap*che" and that hit has to be within ten words of something like 
> "solr" or that "lucene" regex.
> * Can require at least x number of hits at boolean level: "apache AND (lucene 
> solr tika)~2
> * Can use negative only query: -jakarta :: Find all docs that don't contain 
> "jakarta"
> * Can use an edit distance > 2 for fuzzy query via SlowFuzzyQuery (beware of 
> potential performance issues!).
> Trivial additions:
> * Can specify prefix length in fuzzy queries: jakarta~1,2 (edit distance =1, 
> prefix =2)
> * Can specifiy Optimal String Alignment (OSA) vs Levenshtein for distance 
> <=2: (jakarta~1 (OSA) vs jakarta~>1(Levenshtein)
> This parser can be very useful for concordance tasks (see also LUCENE-5317 
> and LUCENE-5318) and for analytical search.  
> Until LUCENE-2878 is closed, this might have a use for fans of SpanQuery.
> Most of the documentation is in the javadoc for SpanQueryParser.
> Any and all feedback is welcome.  Thank you.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to