Thanks for the pointers, Pasquale! Otis
----- Original Message ---- From: Pasquale Imbemba <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, September 23, 2006 4:24:16 AM Subject: Re: Analysis/tokenization of compound words Otis, I forgot to mention that I make use of Lucene for noun retrieval from the lexicon. Pasquale Pasquale Imbemba ha scritto: > Hi Otis, > > I am completing my bachelor thesis at the Free University of Bolzano > (www.unibz.it). My project is exactly about what you need: a word > splitter for German compound words. Raffaella Bernardi who is reading > in CC is my supervisor. > As some from the lucene mailing list has already suggested, I have > used the lexicon of German nouns extracted from Morphy > (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for > the splitting algorithm, I have used the one Maaten De Rijke and > Christof Monz have published in /Shallow Morphological Analysis in > Monolingual > Information Retrieval for Dutch, German and Italian /(website here > <http://www.dcs.qmul.ac.uk/%7Echristof/>, document here > <http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>). > I did some testing and minor improvement on it (as I needed to > "adjust" it for the solution I was working on) and could send you my > thesis paper (actually still in draft state), which contains > statistical data on correctness. > > Let me know > Pasquale > > Otis Gospodnetic ha scritto: >> Hi, >> >> How do people typically analyze/tokenize text with compounds (e.g. >> German)? I took a look at GermanAnalyzer hoping to see how one can >> deal with that, but it turns out GermanAnalyzer doesn't treat >> compounds in any special way at all. >> >> One way to go about this is to have a word dictionary and a tokenizer >> that processes input one character at a time, looking for a word >> match in the dictionary after each processed characters. Then, >> CompoundWordLikeThis could be broken down into multiple tokens/words >> and returned at a set of tokens at the same position. However, >> somehow this doesn't strike me as a very smart and fast approach. >> What are some better approaches? >> If anyone has implemented anything that deals with this problem, I'd >> love to hear about it. >> >> Thanks, >> Otis >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > -- "As far as the laws of mathematics refer to reality, they are not certain, as far as they are certain, they do not refer to reality." (Albert Einstein) --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]