Analysis/tokenization of compound words

Otis Gospodnetic Tue, 19 Sep 2006 09:22:03 -0700

Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?  I 
took a look at GermanAnalyzer hoping to see how one can deal with that, but it 
turns out GermanAnalyzer doesn't treat compounds in any special way at all.


One way to go about this is to have a word dictionary and a tokenizer that 
processes input one character at a time, looking for a word match in the 
dictionary after each processed characters.  Then, CompoundWordLikeThis could 
be broken down into multiple tokens/words and returned at a set of tokens at 
the same position.  However, somehow this doesn't strike me as a very smart and 
fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to 
hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Analysis/tokenization of compound words

Reply via email to