The BreakIterator impls in the JDK (and likely IBM ICU) seem slow and can
sometimes dominate the performance of this highlighter.  I worked on a
large search project (which led to the creation of the UnifiedHighlighter)
and we used a technique of encoding the breaks directly into the text
a-priori.  It was just a special character.  Perhaps use a "vertical tab"?
On the Solr side, it then became a very trivial char based iterator which
is already in Lucene/Solr.  You might do this as well.  You could add a
custom Solr UpdateRequestProcessor (URP) that inserts these characters.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Mar 8, 2021 at 5:06 AM df2832368_...@amberoad.de
df2832368_...@amberoad.de <j...@amberoad.de> wrote:

> And of cource the link broke :
> https://drive.google.com/file/d/1wfZFQD6loTeA9_-eGrdwi9YGtJcNjKli/view?usp=sharing
>
> >     df2832368_...@amberoad.de df2832368_...@amberoad.de <j...@amberoad.de>
> hat am 08.03.2021 11:05 geschrieben:
> >
> >
> >     Hello,
> >
> >     I am currently working on getting a custom BreakIterator for the
> Unified Highlighter to work, and struggle a bit performance wise.
> >
> >     I need a BreakIterator for getting nice highlights of passages. For
> this I want the start of the highlight to be a sentence-start and the end
> to be a word-end. There are also some weird edge cases.
> >
> >     I already coded the BreakIterator and integrated it to our custom
> UnifiedHighlighter class, but when I use this Iterator the qTime of all
> requests rise from ~1000 to 12000+ which is not acceptable for this
> application.
> >
> >     Here is a link to my implementation. I can't really find where I am
> horrible inefficient.(I know that these functions get called very often)
> >
> >     Any suggestions are welcome, also other approaches.
> >
> >     So are there some nice resources to learn more about BreakIterators
> and stuff, since digging into the code is really hard here.
> >
> >     Another approach I am considering next is to do this highlight
> "trimming", when the final highlights are found. This would reduce the
> amount of logic called, but I guess the scoring system of SOLR wouldn't be
> taken in to account the right way.
> >
> >     As I said all suggestions are welcome and thanks in advance.
> >
> >     Jan Ulrich Robens
> >
>

Reply via email to