Background: I've been interested in some specific 3 word shingles. The idea is that, although we throw out stop words like "the", "how", "it", etc, that some 3 word runs that contains those words are actually potentially useful, related to my "Power Law" email a few days back. BTW there's a paper that talks about this, how phrases can act somewhat like unusual words from an IDF perspective: http://ciir-publications.cs.umass.edu/getpdf.php?id=184
The issue: I really like the DisMax query parser, but of course its main design is a bit at odds with shingles and phrases. But I'd seen folks talk about using the local parameters syntax. For example, Chris had chimed in a while back suggesting this approach: http://www.lucidimagination.com/search/document/ea7b0b27b1b17b1c/re_replacing_fast_functionality_atsesam_no_shinglefilter_exactmatching I've also done some other reading on the web and Lucid etc about the curly brace syntax, etc. But this doesn't seem to be working the way thought, with respect to protecting text from the first pass Lucene parser. I have a custom field defined for shingle_type / shingle_text, along with a few classes. If I run this through the analyzer: How does this work? I get: 1: how_does_this 2: does_this_work With the numbers being the offsets. Now I combine that into dismax, and my regular fields which have aggressive stop words: Input: {!dismax qf="title^1.2 summary shingle_text^3.0" v="How does this work?"} Output: +((DisjunctionMaxQuery((title:work^1.2 | summary:work)))~1) () It SHOULD also have shingle_text:how_does_this and shingle_text:does_this_work >From the various threads about shingles, phrases and local parameters, I thought having the v="stuff" would bypass the Lucene parser? Thanks for any ideas y'all might have, Mark PS: I realize that adding "pf" would be similar to what I'm doing, but I don't have as much control of the run of the phrases, and I've got some pretty specific stats in my index on the shingles. And also, I really want to understand the parsing process. PPS: I also looked at the XML query parser stuff, but it's not clear (to me) when that will be in a mainline release (vs a patch), and for various reasons a patch is not desirable on this project. -- Mark Bennett / New Idea Engineering, Inc. / [email protected] Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
