I see, the list of exceptions makes this a lot more complicated than I thought... Thanks a lot, Erik!
Markus On 6/13/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote: > > I work on an application that has to index OCR texts of scanned books. > > Naturally there occur many words that are hyphenated across lines. > > > > I wonder if there is already an Analyzer or maybe a TokenFilter that > > can merge those syllables back into whole words? It looks like Erik > > Hatcher uses something like that at http://www.lucenebook.com/. > > Markus - you're right, I did develop something to handle hyphenated > words for lucenebook.com. It was sort of a hack in that I had to > build in a static list of exceptions in how I handled this, so you'll > likely have to use caution as well. The LiaAnalyzer is this: > > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenFilter filter = new DashSplitterFilter( > new HyphenatedFilter( > new DashDashFilter( > new LiaTokenizer(reader)))); > > filter = new LengthFilter(3, filter); > filter = new StopFilter(filter, stopSet); > > if (stem) { > filter = new SnowballFilter(filter, "English"); > } > > return filter; > } > > > And my HyphenatedFilter is this: > > public class HyphenatedFilter extends TokenFilter { > private HashMap exceptions = new HashMap(); > > private static final String[] EXCEPTION_LIST = { > "full-text", "information-retrieval", "license-code", "old- > fashioned", > "well-designed", "free-form", "file-based", "ramdirectory- > based", "ram-based", > "index-modifying", "read-only", > "top-scoring", "most-recently-used", "queryparser-parsed", > "in-order", "per-document", "lower-caser", "domain-specific", > "high-level", > "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive", > "date-range", "computation-intensive", "hits-returning", "lower- > level", > "number-padding", "utf-address-book", "third-party", "plain- > text", "google-like", > "re-add", "english-specific", "file-handling", "already- > created", "d-add", "d-add", > "hits-length", "hits-doc", "hits-score", "d-get", "writer-new", > "porteranalyzer-new", > "writer-set", "document-new", "doc-add", "field-keyword", > "field-unstored", "writer-add", > "writer-optimize", "queryparser-new", "porteranalyzer-new", > "parser-parse", "indexsearcher-new", > "hitcollector-new", "searcher-doc", "searcher-search", "jakarta- > lucene", "www-ibm", "java-specific", > "non-java", "vis--vis", "medium-sized", "browser-based", "utf- > before", "concept-based", > "natural-language", "queue-based", "high-likelihood", "slp-or", > "noisy-channel", "al-rasheed", > "hands-free", "top-notch", "google-esque", "search-config", > "java-related", > "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar", > "lucene-web", "lucene-webindex", > "command-line", "lucene-version", "issue-tracking" > }; > > protected HyphenatedFilter(TokenStream tokenStream) { > super(tokenStream); > > for (int i = 0; i < EXCEPTION_LIST.length; i++) { > exceptions.put(EXCEPTION_LIST[i], ""); > } > } > > private Token savedToken; > > public Token next() throws IOException { > > if (savedToken != null) { > Token token = savedToken; > savedToken = null; > return token; > } > > Token firstToken = input.next(); > > if (firstToken == null) > return firstToken; > > > if (firstToken.termText().endsWith("-")) { > String firstPart; > firstPart = firstToken.termText(); > > // consume next token > Token secondToken = input.next(); > if (secondToken == null) > return firstToken; > > String termText = firstPart.substring(0, firstPart.length() - > 1) + secondToken.termText(); > > if (exceptions.containsKey(firstPart + secondToken.termText())) { > savedToken = secondToken; > return firstToken; > } > > return new Token(termText, firstToken.startOffset(), > firstToken.endOffset() + secondToken.termText().length() + 1); > } > > return firstToken; > } > } > > Not all that pretty, I'm afraid, but by all means use it if its useful. > > Erik > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Always remember you're unique. Just like everyone else. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]