Thank you, Yonnik for this hint. (Again, I wasn't aware that obviousely
Solr offers useful extensions for the Lucene indexing process and I
wonder why they haven't been added to Lucene itself.)
Anyway, since the HyphenatedWordsFilter needs newlines in the input I
will have to take another Toke
Solr has a hyphenated word filter you could copy.
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html
On trunk, this has been folded into the analysis module.
-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco
Hi,
for indexing PDF files we have to undo word hyphenation. The basic idea
is simply to remove the hyphen when a new line and a small letter
follows. Of course this approach isnt 100%-foolproofed but checking
against a dictionary wouldnt be as well...
Since we face this problem too when hig