eks dev wrote:
Depends what yo need to do with it, if you need this to be only used as "kind of
stemming" for searching documents, solution is not all that complex. If you need
linguisticly correct splitting than it gets complicated.
This is a very good point. Stemming for
high recall is mu
Thanks for the pointers, Pasquale!
Otis
- Original Message
From: Pasquale Imbemba <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, September 23, 2006 4:24:16 AM
Subject: Re: Analysis/tokenization of compound words
Otis,
I forgot to mention that I make use of
ation or just n-gram the input). Guess who their
biggest customer is? Hint: starts with the letter G.
Otis
- Original Message
From: Marvin Humphrey <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, September 23, 2006 11:14:49 AM
Subject: Re: Analysis/toke
On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote:
Writing a decomposer is difficult as you need both a large dictionary
*without* compounds and a set of rules to avoid splitting at too many
positions.
Conceptually, how different is the problem of decompounding German
from tokenizing languag
Otis,
I forgot to mention that I make use of Lucene for noun retrieval from
the lexicon.
Pasquale
Pasquale Imbemba ha scritto:
Hi Otis,
I am completing my bachelor thesis at the Free University of Bolzano
(www.unibz.it). My project is exactly about what you need: a word
splitter for Germa
Hi Otis,
I am completing my bachelor thesis at the Free University of Bolzano
(www.unibz.it). My project is exactly about what you need: a word
splitter for German compound words. Raffaella Bernardi who is reading in
CC is my supervisor.
As some from the lucene mailing list has already suggest
Aspell has some support for compound words that might be useful to look
at:
http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word
s
Peter
Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote:
>
> How do people typically analyze/tokenize text with compounds (e.g.
> German)? I took a look at GermanAnalyzer hoping to see how one can
> deal with that, but it turns out GermanAnalyzer doesn't treat
> compounds in any special way at
On Tuesday 19 September 2006 22:15, eks dev wrote:
> Daniel Naber made some work with German dictionaries as well, if I
> recall well, maybe he has something that helps
The company I work for offers a commercial Java component for decomposing
and lemmatizing German words, see http://demo.intrafi
On Tuesday 19 September 2006 22:41, eks dev wrote:
> ahh, another one, when you strip suffix, check if last char on remaining
> "stem" is "s" (magic thing in German), delete it if not the only
> letter do not ask why, long unexplained mistery of German language
This is called "Fugenelement" a
OTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, 19 September, 2006 10:15:04 PM
Subject: Re: Analysis/tokenization of compound words
Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind
of stemming" for searching documents, solution is not all that comple
Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind
of stemming" for searching documents, solution is not all that complex. If you
need linguisticly correct splitting than it gets complicated.
for the first case:
Build SuffixTree with your dictionary (hope you
On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote:
How do people typically analyze/tokenize text with compounds (e.g.
German)? I took a look at GermanAnalyzer hoping to see how one can
deal with that, but it turns out GermanAnalyzer doesn't treat
compounds in any special way at all.
O
Otis,
I can't offer you any practical advice, but as a student of German, I can tell you that beginners find it difficult to read German words and split them properly. The larger your vocabulary the easier it is. The whole topic sounds like an AI problem:
A possible algorithm for German (no ide
14 matches
Mail list logo