Aspell has some support for compound words that might be useful to look at:
http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word s Peter Peter Binkley Digital Initiatives Technology Librarian Information Technology Services 4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta Canada T6G 2J8 Phone: (780) 492-3743 Fax: (780) 492-9243 e-mail: [EMAIL PROTECTED] -----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 19, 2006 10:22 AM To: java-user@lucene.apache.org Subject: Analysis/tokenization of compound words Hi, How do people typically analyze/tokenize text with compounds (e.g. German)? I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all. One way to go about this is to have a word dictionary and a tokenizer that processes input one character at a time, looking for a word match in the dictionary after each processed characters. Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at a set of tokens at the same position. However, somehow this doesn't strike me as a very smart and fast approach. What are some better approaches? If anyone has implemented anything that deals with this problem, I'd love to hear about it. Thanks, Otis --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]