Re: Analysis/tokenization of compound words (German, Chinese, etc.)

2006-11-21 Thread Bob Carpenter
eks dev wrote: Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated. This is a very good point. Stemming for high recall is mu

Re: Analysis/tokenization of compound words

2006-09-23 Thread Otis Gospodnetic
Thanks for the pointers, Pasquale! Otis - Original Message From: Pasquale Imbemba <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, September 23, 2006 4:24:16 AM Subject: Re: Analysis/tokenization of compound words Otis, I forgot to mention that I make use of

Re: Analysis/tokenization of compound words

2006-09-23 Thread Otis Gospodnetic
ation or just n-gram the input). Guess who their biggest customer is? Hint: starts with the letter G. Otis - Original Message From: Marvin Humphrey <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, September 23, 2006 11:14:49 AM Subject: Re: Analysis/toke

Re: Analysis/tokenization of compound words

2006-09-23 Thread Marvin Humphrey
On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote: Writing a decomposer is difficult as you need both a large dictionary *without* compounds and a set of rules to avoid splitting at too many positions. Conceptually, how different is the problem of decompounding German from tokenizing languag

Re: Analysis/tokenization of compound words

2006-09-23 Thread Pasquale Imbemba
Otis, I forgot to mention that I make use of Lucene for noun retrieval from the lexicon. Pasquale Pasquale Imbemba ha scritto: Hi Otis, I am completing my bachelor thesis at the Free University of Bolzano (www.unibz.it). My project is exactly about what you need: a word splitter for Germa

Re: Analysis/tokenization of compound words

2006-09-23 Thread Pasquale Imbemba
Hi Otis, I am completing my bachelor thesis at the Free University of Bolzano (www.unibz.it). My project is exactly about what you need: a word splitter for German compound words. Raffaella Bernardi who is reading in CC is my supervisor. As some from the lucene mailing list has already suggest

RE: Analysis/tokenization of compound words

2006-09-21 Thread Binkley, Peter
Aspell has some support for compound words that might be useful to look at: http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word s Peter Peter Binkley Digital Initiatives Technology Librarian Information Technology Services 4-30 Cameron Library University of Alberta Libraries

Re: Analysis/tokenization of compound words

2006-09-20 Thread karl wettin
On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote: > > How do people typically analyze/tokenize text with compounds (e.g. > German)? I took a look at GermanAnalyzer hoping to see how one can > deal with that, but it turns out GermanAnalyzer doesn't treat > compounds in any special way at

Re: Analysis/tokenization of compound words

2006-09-20 Thread Daniel Naber
On Tuesday 19 September 2006 22:15, eks dev wrote: > Daniel Naber made some work with German dictionaries as well, if I > recall well, maybe he has something that helps The company I work for offers a commercial Java component for decomposing and lemmatizing German words, see http://demo.intrafi

Re: Analysis/tokenization of compound words

2006-09-19 Thread Daniel Naber
On Tuesday 19 September 2006 22:41, eks dev wrote: > ahh, another one, when you strip suffix, check if last char on remaining > "stem" is "s" (magic thing in German), delete it if not the only > letter do not ask why, long unexplained mistery of German language This is called "Fugenelement" a

Re: Analysis/tokenization of compound words

2006-09-19 Thread eks dev
OTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 19 September, 2006 10:15:04 PM Subject: Re: Analysis/tokenization of compound words Hi Otis, Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that comple

Re: Analysis/tokenization of compound words

2006-09-19 Thread eks dev
Hi Otis, Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated. for the first case: Build SuffixTree with your dictionary (hope you

Re: Analysis/tokenization of compound words

2006-09-19 Thread Marvin Humphrey
On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote: How do people typically analyze/tokenize text with compounds (e.g. German)? I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all. O

Re: Analysis/tokenization of compound words

2006-09-19 Thread Jonathan O'Connor
Otis, I can't offer you any practical advice, but as a student of German, I can tell you that beginners find it difficult to read German words and split them properly. The larger your vocabulary the easier it is. The whole topic sounds like an AI problem: A possible algorithm for German (no ide