I'm working on building a custom highlighter for a client, which may eventually be generalizable. In my work, I've come across some issues I'd like to discuss. One issue is of appended fields allowing querying across boundaries. For example, if I index two fields with the same name:

doc.add(new Field("repeated", "this is a repeated field - first instance", Field.Store.YES,
                Field.Index.TOKENIZED));
doc.add(new Field("repeated", "this is a repeated field - second instance", Field.Store.YES,
                Field.Index.TOKENIZED));

A query, of course by design, can span across those two field instances. This query, against an index with only the above single document in it, shows the effect:

SpanTermQuery first = new SpanTermQuery(new Term("repeated", "first")); SpanTermQuery second = new SpanTermQuery(new Term("repeated", "second")); SpanNearQuery wrapped = new SpanNearQuery(new SpanQuery[] { first, second}, 7, true);

    IndexSearcher searcher = new IndexSearcher(reader);
    Hits hits = searcher.search(wrapped);
    assertEquals(1, hits.length());

So my first question is how could this match be prevented? Technically if the second "this" has a large position increment then there would be no match. But how could I achieve that large position increment? Does it make sense to add an IndexWriter setting to specify a default position increment gap to use when multiple fields are added in this way? And likely a setting on a per-field basis to specify an increment offset to use for that individual field. There isn't a way an Analyzer itself could address this situation, is there?

Highlighting is quite a challenging endeavor! Spans certainly provides a lot of help, but in the appended field scenario, the Spans.start() and .end() goes across the field boundary, so it requires, in my case with the text coming from stored field values, cleverness in how to highlight in order to keep field instances separate.

So, should we make some changes in allowing offsets to be controllable?

Thanks,
    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to