I'm working on building a custom highlighter for a client, which may
eventually be generalizable. In my work, I've come across some
issues I'd like to discuss. One issue is of appended fields allowing
querying across boundaries. For example, if I index two fields with
the same name:
doc.add(new Field("repeated", "this is a repeated field -
first instance", Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("repeated", "this is a repeated field -
second instance", Field.Store.YES,
Field.Index.TOKENIZED));
A query, of course by design, can span across those two field
instances. This query, against an index with only the above single
document in it, shows the effect:
SpanTermQuery first = new SpanTermQuery(new Term("repeated",
"first"));
SpanTermQuery second = new SpanTermQuery(new Term("repeated",
"second"));
SpanNearQuery wrapped = new SpanNearQuery(new SpanQuery[]
{ first, second}, 7, true);
IndexSearcher searcher = new IndexSearcher(reader);
Hits hits = searcher.search(wrapped);
assertEquals(1, hits.length());
So my first question is how could this match be prevented?
Technically if the second "this" has a large position increment then
there would be no match. But how could I achieve that large position
increment? Does it make sense to add an IndexWriter setting to
specify a default position increment gap to use when multiple fields
are added in this way? And likely a setting on a per-field basis to
specify an increment offset to use for that individual field. There
isn't a way an Analyzer itself could address this situation, is there?
Highlighting is quite a challenging endeavor! Spans certainly
provides a lot of help, but in the appended field scenario, the
Spans.start() and .end() goes across the field boundary, so it
requires, in my case with the text coming from stored field values,
cleverness in how to highlight in order to keep field instances
separate.
So, should we make some changes in allowing offsets to be controllable?
Thanks,
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]