On Fri, Oct 10, 2014 at 10:46 AM, [email protected]
<[email protected]> wrote:
> On Fri, Oct 10, 2014 at 7:13 AM, Robert Muir <[email protected]> wrote:
>>
>> On Fri, Oct 10, 2014 at 12:38 AM, [email protected]
>> <[email protected]> wrote:
>> > The fastest
>> > highlighter we’ve got in Lucene is the PostingsHighlighter but it throws
>> > out
>> > any positional nature in the query and can highlight more inaccurately
>> > than
>> > the other two highlighters. mission from the sponsor commissioning this
>> > effort.
>> >
>>
>> Thats because it tries to summarize the document contents wrt to the
>> query, so the user can decide if its relevant (versus being a debugger
>> for span queries, or whatever). The algorithms used to do this don't
>> really get benefits from positions, because they are the same ones
>> used for regular IR.
>>
>>
>> In short, the "inaccuracy" is important, because this highlighter is
>> trying to do something different than the other highlighters.
>
>
> I’m confused how inaccuracy is a feature, but nevertheless I appreciate that
> the postings highlighter as-is is good enough for most users.  Thanks for
> your awesome work on this highlighter, by the way!
>

well its not intended to be "better", just "different". Its for the
full-text use case. So it tries to just apply some simple IR stuff in
a "miniature" world efficiently. So maybe its good if they are working
with totally unstructured text.

For data that is more structured, i would probably approach the
problem completely differently.

I think this is a key thing, I dont think there should/will be "one
way" but choices for the user. For someone that cares more about
'matching complex structured data', they will not be happy with this
choice of highlighter. In that case you can probably bail out on
scoring sentences and other concepts (assume docs are short, maybe
even just always show all matches to keep it super simple).

>>
>> The reason it might be faster in comparison has less to do with the
>> fact it reads offsets from the postings lists and more to do with the
>> fact it does not have bad O(n^2) etc algorithms that the other
>> highlighters do. Its not faster: it just does not blow up.
>
>
> Well, it isn’t cheap to re-analyze the document text (what the default
> highlighter does) nor to read term-vectors and sort the tokens (what the
> default highlighter does when term vectors are available).  At least not
> with big docs (lots of text to analyze or large term vectors to read and
> sort).  My first steps were to try and make the default highlighter faster
> but it still isn’t fast enough and it isn’t accurate enough either (for me).

It depends on the query what is cheap. If the query is a wildcard
query, then its cheaper to either do nothing at all with it (what
postingshighlighter does by default), or only expand the wildcard
against the per-document contents versus the entire index (what
postingshighlighter optionally does, but uses re-analysis for such
queries).

>
> I looked at the FVH a little but thought I’d skip the heft of term vectors
> and use PostingsHighlighter, now that I’m willing to break open these
> complex beasts and build what’s needed to meet my accuracy requirements.

But if you care more about matching what the query did, term vectors
might be useful, because they contain a per-document term-dictionary
to expand wildcards and other multitermqueries against without
reanalysis, and without blowing up.

>
> Do you foresee any O(n^2) algorithms in what I’ve said?
>
> I plan to make simple approximations to score one passage relative to
> another.  The passage with the most diversity in query terms wins, or at
> least is the highest scoring factor. Then, low within-doc-freq (on a
> per-term basis).  Then, high freq in the passage.  Then, shortness of
> passage and closeness to the beginning.  In short, something fast to compute
> and pretty reasonable — my principal requirement is highlighting accuracy,
> and needs to support a lot of query types (incl. custom span queries).

Well, in this case you bail on any notion of term importance. But if
you simplify relevance to just be a coordinate match, then its
probably not so bad to have no "IDF", depending on what queries look
like.

For PH, since its targeted at unstructured text, I don't think it
would be the right tradeoff, because sentences with lots of stopwords
could get ranked about ones that have the "important term" only once.
For reasons like this, it just adapts BM25 at a microsopic level. And
thats why it uses a simple extractTerms approach (how to assign a
useful IDF to a phrase or a span without doing somethign expensive
across potentially tons of data, etc).

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to