The BreakIterator impls in the JDK (and likely IBM ICU) seem slow and can sometimes dominate the performance of this highlighter. I worked on a large search project (which led to the creation of the UnifiedHighlighter) and we used a technique of encoding the breaks directly into the text a-priori. It was just a special character. Perhaps use a "vertical tab"? On the Solr side, it then became a very trivial char based iterator which is already in Lucene/Solr. You might do this as well. You could add a custom Solr UpdateRequestProcessor (URP) that inserts these characters.
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Mon, Mar 8, 2021 at 5:06 AM df2832368_...@amberoad.de df2832368_...@amberoad.de <j...@amberoad.de> wrote: > And of cource the link broke : > https://drive.google.com/file/d/1wfZFQD6loTeA9_-eGrdwi9YGtJcNjKli/view?usp=sharing > > > df2832368_...@amberoad.de df2832368_...@amberoad.de <j...@amberoad.de> > hat am 08.03.2021 11:05 geschrieben: > > > > > > Hello, > > > > I am currently working on getting a custom BreakIterator for the > Unified Highlighter to work, and struggle a bit performance wise. > > > > I need a BreakIterator for getting nice highlights of passages. For > this I want the start of the highlight to be a sentence-start and the end > to be a word-end. There are also some weird edge cases. > > > > I already coded the BreakIterator and integrated it to our custom > UnifiedHighlighter class, but when I use this Iterator the qTime of all > requests rise from ~1000 to 12000+ which is not acceptable for this > application. > > > > Here is a link to my implementation. I can't really find where I am > horrible inefficient.(I know that these functions get called very often) > > > > Any suggestions are welcome, also other approaches. > > > > So are there some nice resources to learn more about BreakIterators > and stuff, since digging into the code is really hard here. > > > > Another approach I am considering next is to do this highlight > "trimming", when the final highlights are found. This would reduce the > amount of logic called, but I guess the scoring system of SOLR wouldn't be > taken in to account the right way. > > > > As I said all suggestions are welcome and thanks in advance. > > > > Jan Ulrich Robens > > >