Re: Undo hyphenation when indexing

2011-04-04 Thread Wulf Berschin
Thank you, Yonnik for this hint. (Again, I wasn't aware that obviousely Solr offers useful extensions for the Lucene indexing process and I wonder why they haven't been added to Lucene itself.) Anyway, since the HyphenatedWordsFilter needs newlines in the input I will have to take another Toke

Re: Undo hyphenation when indexing

2011-04-01 Thread Yonik Seeley
Solr has a hyphenated word filter you could copy. http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html On trunk, this has been folded into the analysis module. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco

Undo hyphenation when indexing

2011-04-01 Thread Wulf Berschin
Hi, for indexing PDF files we have to undo word hyphenation. The basic idea is simply to remove the hyphen when a new line and a small letter follows. Of course this approach isnt 100%-foolproofed but checking against a dictionary wouldnt be as well... Since we face this problem too when hig