> #2 & #3 is the same requirement; you elaborate on #2 with more detail in #3. > The UH can't currently do this; but with the OH (original Highlighter) you > can but it appears somewhat awkward. See SimpleSpanFragmenter. I had said > it was easy but I was mistaken; I'm getting rustier on the OH.
Well, the requirement here is that we do want the context of a hit and "broaden" it to roughly X characters (total). I do see that OH does have something like this with regexp fragmenter (slop factor), but I hoped this should be somewhat easier. I just spent an hour or so trying to tune it, but without much success. > With the original Highlighter, you get TextFragment instances which contain > the textStartPos & textEndPos. You can use that info to conditionally add > ellipsis. Yup, I realize that. I just wanted something that'd do it out of the box in Solr because I didn't want to add custom code to the distribution/core. Sigh. > With the original Highlighter, you can easily do this by providing the > stored text. When you create the QueryScorer, use "null" for field name to > highlight all query fields. The UH can do this as well by highlighting the > fields that are stored, and call setFieldMatcher to provide a Predicate that > always return true. Wouldn't this be equivalent to requireFieldMatch=false? It's not exactly what I had in mind -- I don't want to highlight across all fields, I want to highlight those that actually contributed to the document being selected. Imagine the following: { a: "foo bar", b: "foo baz", c: "foo bat" } Let's say "a" and "b" are copied to the sink field (default search field), but "c" is not. The highlighter is asked to highlight all fields. For a query: "foo" it should return a highlight on "a" and "b", but not "c". On the other hand, a query "c:foo" should only highlight "c". In other words -- the user should clearly see which fields actually contributed to the document being part of the search result. requireFieldMatch=false is a really crude cannon to solve this. > Yeah I already explained why your snippet-centering requirement simply can't > be met with the UH. Thanks, I thought so. We actually have a custom highlighter (unrelated to Solr) in our commercial product that works on a slightly different basis than what can be found in Lucene (I think). The pipeline there is as follows: 1) determine "highlight" offset ranges (from, to, type). Highlight "types" can be different so that, for example, one can highlight two queries at once (and they can overlap in all kinds of ways). 2) process highlight ranges so that they're hierarchically nested (split non-tree-like overlaps into hierarchical descents). This permits emitting easier html markup later on. 3) expand each highlight range to fit certain criteria (typically the desired length of the snippet), this expansion here uses a break iterator (on words) and respects certain hard limits (like value boundaries for multivalue fields); 4) score each such expanded range; the scoring formula checks if there are any other highlights that fall within the same window; if so, they receive a higher score. This results in multi-term matches typically ending up at the top of the scoring list. 5) emit the best scoring ranges, marking highlights properly. We actually use UnifiedHighlighter for the first step above, the rest is custom. It can be used to pretty much highlight anything since the inputs are the text itself and the ranges to highlight (offsets + type). Note it doesn't solve the problem of the default field highlighting -- this is something that'd have to be addressed separately, but it's been working for us fairly well in practice. I'd be glad to contribute this code back to Lucene, but it's kind of detached from the infrastructure and it'd require some work to integrate. :( Dawid --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org