On Fri, Oct 10, 2014 at 10:46 AM, [email protected] <[email protected]> wrote: > On Fri, Oct 10, 2014 at 7:13 AM, Robert Muir <[email protected]> wrote: >> >> On Fri, Oct 10, 2014 at 12:38 AM, [email protected] >> <[email protected]> wrote: >> > The fastest >> > highlighter we’ve got in Lucene is the PostingsHighlighter but it throws >> > out >> > any positional nature in the query and can highlight more inaccurately >> > than >> > the other two highlighters. mission from the sponsor commissioning this >> > effort. >> > >> >> Thats because it tries to summarize the document contents wrt to the >> query, so the user can decide if its relevant (versus being a debugger >> for span queries, or whatever). The algorithms used to do this don't >> really get benefits from positions, because they are the same ones >> used for regular IR. >> >> >> In short, the "inaccuracy" is important, because this highlighter is >> trying to do something different than the other highlighters. > > > I’m confused how inaccuracy is a feature, but nevertheless I appreciate that > the postings highlighter as-is is good enough for most users. Thanks for > your awesome work on this highlighter, by the way! >
well its not intended to be "better", just "different". Its for the full-text use case. So it tries to just apply some simple IR stuff in a "miniature" world efficiently. So maybe its good if they are working with totally unstructured text. For data that is more structured, i would probably approach the problem completely differently. I think this is a key thing, I dont think there should/will be "one way" but choices for the user. For someone that cares more about 'matching complex structured data', they will not be happy with this choice of highlighter. In that case you can probably bail out on scoring sentences and other concepts (assume docs are short, maybe even just always show all matches to keep it super simple). >> >> The reason it might be faster in comparison has less to do with the >> fact it reads offsets from the postings lists and more to do with the >> fact it does not have bad O(n^2) etc algorithms that the other >> highlighters do. Its not faster: it just does not blow up. > > > Well, it isn’t cheap to re-analyze the document text (what the default > highlighter does) nor to read term-vectors and sort the tokens (what the > default highlighter does when term vectors are available). At least not > with big docs (lots of text to analyze or large term vectors to read and > sort). My first steps were to try and make the default highlighter faster > but it still isn’t fast enough and it isn’t accurate enough either (for me). It depends on the query what is cheap. If the query is a wildcard query, then its cheaper to either do nothing at all with it (what postingshighlighter does by default), or only expand the wildcard against the per-document contents versus the entire index (what postingshighlighter optionally does, but uses re-analysis for such queries). > > I looked at the FVH a little but thought I’d skip the heft of term vectors > and use PostingsHighlighter, now that I’m willing to break open these > complex beasts and build what’s needed to meet my accuracy requirements. But if you care more about matching what the query did, term vectors might be useful, because they contain a per-document term-dictionary to expand wildcards and other multitermqueries against without reanalysis, and without blowing up. > > Do you foresee any O(n^2) algorithms in what I’ve said? > > I plan to make simple approximations to score one passage relative to > another. The passage with the most diversity in query terms wins, or at > least is the highest scoring factor. Then, low within-doc-freq (on a > per-term basis). Then, high freq in the passage. Then, shortness of > passage and closeness to the beginning. In short, something fast to compute > and pretty reasonable — my principal requirement is highlighting accuracy, > and needs to support a lot of query types (incl. custom span queries). Well, in this case you bail on any notion of term importance. But if you simplify relevance to just be a coordinate match, then its probably not so bad to have no "IDF", depending on what queries look like. For PH, since its targeted at unstructured text, I don't think it would be the right tradeoff, because sentences with lots of stopwords could get ranked about ones that have the "important term" only once. For reasons like this, it just adapts BM25 at a microsopic level. And thats why it uses a simple extractTerms approach (how to assign a useful IDF to a phrase or a span without doing somethign expensive across potentially tons of data, etc). --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
