Caveat to the below is that I am very new to lucene. (That said though,
following the below strategy, after a couple of days work I have a set of
per field analyzers for various languages, using various custom filters,
caching of initial analysis; and capable of outputting stemmed, reversed,
diacri
http://snowball.tartarus.org/ for stemming
2011/8/22 Saar Carmi
> Hi
> Where can I find a guide for building analyzers, filters and tokenizers?
>
> Saar
>
You could write your own analyzer that worked out a boost as it
analyzed the document fields and had a getBoost() method that you
would call to get the value to add to the document as a separate
field. If you write your own you can pass it what you like and it can
do whatever you want.
--
Ian.
Hi Anshum-
> You might want to look at writing a custom analyzer or something and
> add a
> document boost (while indexing) for documents containing those terms.
Do you know how to access the document from an analyzer? It seems to only have
access to the field...
Thanks,
-Chris
---
Hi Cristopher,
You might want to look at writing a custom analyzer or something and add a
document boost (while indexing) for documents containing those terms.
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com
The facts expressed here belong to everybody, the opinions to me. The
distinctio
Perhaps a better question: let's say I have a few thousand terms or phrases. I
want to prefer documents with these phrases in my search results over documents
that do not have these terms or phrases. What's the best way to accomplish this?
Thanks,
-Chris
> -Original Message-
> From: Chri
You can use your Analyzer to get a token stream from any text you give
it, just like Lucene does.
Something like:
String text = "your list of words to analyze and tokenize";
TokenStream ts = YOUR_ANALYZER.tokenStream(null, new
StringReader(text));
Token token = new Token();
while((ts.next(tok
eks dev wrote:
Depends what yo need to do with it, if you need this to be only used as "kind of
stemming" for searching documents, solution is not all that complex. If you need
linguisticly correct splitting than it gets complicated.
This is a very good point. Stemming for
high recall is mu
Thanks for the pointers, Pasquale!
Otis
- Original Message
From: Pasquale Imbemba <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, September 23, 2006 4:24:16 AM
Subject: Re: Analysis/tokenization of compound words
Otis,
I forgot to mention that I make use of
ation or just n-gram the input). Guess who their
biggest customer is? Hint: starts with the letter G.
Otis
- Original Message
From: Marvin Humphrey <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, September 23, 2006 11:14:49 AM
Subject: Re: Analysis/toke
On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote:
Writing a decomposer is difficult as you need both a large dictionary
*without* compounds and a set of rules to avoid splitting at too many
positions.
Conceptually, how different is the problem of decompounding German
from tokenizing languag
Otis,
I forgot to mention that I make use of Lucene for noun retrieval from
the lexicon.
Pasquale
Pasquale Imbemba ha scritto:
Hi Otis,
I am completing my bachelor thesis at the Free University of Bolzano
(www.unibz.it). My project is exactly about what you need: a word
splitter for Germa
Hi Otis,
I am completing my bachelor thesis at the Free University of Bolzano
(www.unibz.it). My project is exactly about what you need: a word
splitter for German compound words. Raffaella Bernardi who is reading in
CC is my supervisor.
As some from the lucene mailing list has already suggest
Aspell has some support for compound words that might be useful to look
at:
http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word
s
Peter
Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote:
>
> How do people typically analyze/tokenize text with compounds (e.g.
> German)? I took a look at GermanAnalyzer hoping to see how one can
> deal with that, but it turns out GermanAnalyzer doesn't treat
> compounds in any special way at
On Tuesday 19 September 2006 22:15, eks dev wrote:
> Daniel Naber made some work with German dictionaries as well, if I
> recall well, maybe he has something that helps
The company I work for offers a commercial Java component for decomposing
and lemmatizing German words, see http://demo.intrafi
On Tuesday 19 September 2006 22:41, eks dev wrote:
> ahh, another one, when you strip suffix, check if last char on remaining
> "stem" is "s" (magic thing in German), delete it if not the only
> letter do not ask why, long unexplained mistery of German language
This is called "Fugenelement" a
OTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, 19 September, 2006 10:15:04 PM
Subject: Re: Analysis/tokenization of compound words
Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind
of stemming" for searching documents, solution is not all that comple
Hi Otis,
Depends what yo need to do with it, if you need this to be only used as "kind
of stemming" for searching documents, solution is not all that complex. If you
need linguisticly correct splitting than it gets complicated.
for the first case:
Build SuffixTree with your dictionary (hope you
On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote:
How do people typically analyze/tokenize text with compounds (e.g.
German)? I took a look at GermanAnalyzer hoping to see how one can
deal with that, but it turns out GermanAnalyzer doesn't treat
compounds in any special way at all.
O
Otis,
I can't offer you any practical advice, but as a student of German, I can tell you that beginners find it difficult to read German words and split them properly. The larger your vocabulary the easier it is. The whole topic sounds like an AI problem:
A possible algorithm for German (no ide
On 1 Nov 2005, at 11:02, Malcolm wrote:
Hi,
I've been reading my new project bible 'Lucene in Action'
Amen! ;)
about Analysis in Chapter 4 and wondered what others are doing for
indexing XML(if anyone else is, that is!).
Are you folks just writing your own or utilising the current Lucene
I'm currently indexing the INEX collection and then performing queries on
the Format features within the text. I am using a wide range of the XML
features. The reason I asked about the XML Analysis is I am interesting in
opinions and reasons for adding a wide range of discussion to my
dissertat
Not sure I am understanding your question correctly, but I think you
want to pick your Analyzer based on what is in your content (i.e.
language, usage of special symbols, etc.), not based on what the format
of your content is (i.e. XML).
Malcolm wrote:
Hi,
I'm just asking for opinions on Ana
so I will end up using my own custom analyzer class primarily
based on the StandardAnalyzer but modifying the stop word list.
> -Original Message-
> From: Malcolm [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 01, 2005 11:19 AM
> To: java-user@lucene.apache.org
&
Hi,
I'm just asking for opinions on Analyzer's for the indexing. For example
Otis in his article uses the WhitespaceAnalyzer and the Sandbox program uses
the StandardAnalyzer.I am just gauging opinions on the subject with regard
to XML.
I'm using a mix of the Sandbox XMLDocumentHandlerSAX and a
Not exactly sure what you're asking with regards to Analyzers and
parsing XML...
But for parsing and indexing XML documents with Lucene, you can find a
lot of material out there by searching the list archives and using
google. However, the document I found most helpful was this piece
written by Ot
27 matches
Mail list logo