Re: Analysis

2011-08-22 Thread Graham Sugden
Caveat to the below is that I am very new to lucene. (That said though, following the below strategy, after a couple of days work I have a set of per field analyzers for various languages, using various custom filters, caching of initial analysis; and capable of outputting stemmed, reversed, diacri

Re: Analysis

2011-08-22 Thread Mihai Caraman
http://snowball.tartarus.org/ for stemming 2011/8/22 Saar Carmi > Hi > Where can I find a guide for building analyzers, filters and tokenizers? > > Saar >

Re: Analysis Question

2009-08-07 Thread Ian Lea
You could write your own analyzer that worked out a boost as it analyzed the document fields and had a getBoost() method that you would call to get the value to add to the document as a separate field. If you write your own you can pass it what you like and it can do whatever you want. -- Ian.

RE: Analysis Question

2009-08-06 Thread Christopher Condit
Hi Anshum- > You might want to look at writing a custom analyzer or something and > add a > document boost (while indexing) for documents containing those terms. Do you know how to access the document from an analyzer? It seems to only have access to the field... Thanks, -Chris ---

Re: Analysis Question

2009-08-06 Thread Anshum
Hi Cristopher, You might want to look at writing a custom analyzer or something and add a document boost (while indexing) for documents containing those terms. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinctio

RE: Analysis Question

2009-08-05 Thread Christopher Condit
Perhaps a better question: let's say I have a few thousand terms or phrases. I want to prefer documents with these phrases in my search results over documents that do not have these terms or phrases. What's the best way to accomplish this? Thanks, -Chris > -Original Message- > From: Chri

Re: analysis filter wrapper

2009-05-14 Thread Joel Halbert
You can use your Analyzer to get a token stream from any text you give it, just like Lucene does. Something like: String text = "your list of words to analyze and tokenize"; TokenStream ts = YOUR_ANALYZER.tokenStream(null, new StringReader(text)); Token token = new Token(); while((ts.next(tok

Re: Analysis/tokenization of compound words (German, Chinese, etc.)

2006-11-21 Thread Bob Carpenter
eks dev wrote: Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated. This is a very good point. Stemming for high recall is mu

Re: Analysis/tokenization of compound words

2006-09-23 Thread Otis Gospodnetic
Thanks for the pointers, Pasquale! Otis - Original Message From: Pasquale Imbemba <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, September 23, 2006 4:24:16 AM Subject: Re: Analysis/tokenization of compound words Otis, I forgot to mention that I make use of

Re: Analysis/tokenization of compound words

2006-09-23 Thread Otis Gospodnetic
ation or just n-gram the input). Guess who their biggest customer is? Hint: starts with the letter G. Otis - Original Message From: Marvin Humphrey <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, September 23, 2006 11:14:49 AM Subject: Re: Analysis/toke

Re: Analysis/tokenization of compound words

2006-09-23 Thread Marvin Humphrey
On Sep 20, 2006, at 12:07 AM, Daniel Naber wrote: Writing a decomposer is difficult as you need both a large dictionary *without* compounds and a set of rules to avoid splitting at too many positions. Conceptually, how different is the problem of decompounding German from tokenizing languag

Re: Analysis/tokenization of compound words

2006-09-23 Thread Pasquale Imbemba
Otis, I forgot to mention that I make use of Lucene for noun retrieval from the lexicon. Pasquale Pasquale Imbemba ha scritto: Hi Otis, I am completing my bachelor thesis at the Free University of Bolzano (www.unibz.it). My project is exactly about what you need: a word splitter for Germa

Re: Analysis/tokenization of compound words

2006-09-23 Thread Pasquale Imbemba
Hi Otis, I am completing my bachelor thesis at the Free University of Bolzano (www.unibz.it). My project is exactly about what you need: a word splitter for German compound words. Raffaella Bernardi who is reading in CC is my supervisor. As some from the lucene mailing list has already suggest

RE: Analysis/tokenization of compound words

2006-09-21 Thread Binkley, Peter
Aspell has some support for compound words that might be useful to look at: http://aspell.sourceforge.net/man-html/Compound-Words.html#Compound-Word s Peter Peter Binkley Digital Initiatives Technology Librarian Information Technology Services 4-30 Cameron Library University of Alberta Libraries

Re: Analysis/tokenization of compound words

2006-09-20 Thread karl wettin
On Tue, 2006-09-19 at 09:21 -0700, Otis Gospodnetic wrote: > > How do people typically analyze/tokenize text with compounds (e.g. > German)? I took a look at GermanAnalyzer hoping to see how one can > deal with that, but it turns out GermanAnalyzer doesn't treat > compounds in any special way at

Re: Analysis/tokenization of compound words

2006-09-20 Thread Daniel Naber
On Tuesday 19 September 2006 22:15, eks dev wrote: > Daniel Naber made some work with German dictionaries as well, if I > recall well, maybe he has something that helps The company I work for offers a commercial Java component for decomposing and lemmatizing German words, see http://demo.intrafi

Re: Analysis/tokenization of compound words

2006-09-19 Thread Daniel Naber
On Tuesday 19 September 2006 22:41, eks dev wrote: > ahh, another one, when you strip suffix, check if last char on remaining > "stem" is "s" (magic thing in German), delete it if not the only > letter do not ask why, long unexplained mistery of German language This is called "Fugenelement" a

Re: Analysis/tokenization of compound words

2006-09-19 Thread eks dev
OTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 19 September, 2006 10:15:04 PM Subject: Re: Analysis/tokenization of compound words Hi Otis, Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that comple

Re: Analysis/tokenization of compound words

2006-09-19 Thread eks dev
Hi Otis, Depends what yo need to do with it, if you need this to be only used as "kind of stemming" for searching documents, solution is not all that complex. If you need linguisticly correct splitting than it gets complicated. for the first case: Build SuffixTree with your dictionary (hope you

Re: Analysis/tokenization of compound words

2006-09-19 Thread Marvin Humphrey
On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote: How do people typically analyze/tokenize text with compounds (e.g. German)? I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all. O

Re: Analysis/tokenization of compound words

2006-09-19 Thread Jonathan O'Connor
Otis, I can't offer you any practical advice, but as a student of German, I can tell you that beginners find it difficult to read German words and split them properly. The larger your vocabulary the easier it is. The whole topic sounds like an AI problem: A possible algorithm for German (no ide

Re: Analysis

2005-11-01 Thread Erik Hatcher
On 1 Nov 2005, at 11:02, Malcolm wrote: Hi, I've been reading my new project bible 'Lucene in Action' Amen! ;) about Analysis in Chapter 4 and wondered what others are doing for indexing XML(if anyone else is, that is!). Are you folks just writing your own or utilising the current Lucene

Re: Analysis

2005-11-01 Thread Malcolm
I'm currently indexing the INEX collection and then performing queries on the Format features within the text. I am using a wide range of the XML features. The reason I asked about the XML Analysis is I am interesting in opinions and reasons for adding a wide range of discussion to my dissertat

Re: Analysis

2005-11-01 Thread Grant Ingersoll
Not sure I am understanding your question correctly, but I think you want to pick your Analyzer based on what is in your content (i.e. language, usage of special symbols, etc.), not based on what the format of your content is (i.e. XML). Malcolm wrote: Hi, I'm just asking for opinions on Ana

RE: Analysis

2005-11-01 Thread Peter Kim
so I will end up using my own custom analyzer class primarily based on the StandardAnalyzer but modifying the stop word list. > -Original Message- > From: Malcolm [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 01, 2005 11:19 AM > To: java-user@lucene.apache.org &

Re: Analysis

2005-11-01 Thread Malcolm
Hi, I'm just asking for opinions on Analyzer's for the indexing. For example Otis in his article uses the WhitespaceAnalyzer and the Sandbox program uses the StandardAnalyzer.I am just gauging opinions on the subject with regard to XML. I'm using a mix of the Sandbox XMLDocumentHandlerSAX and a

RE: Analysis

2005-11-01 Thread Peter Kim
Not exactly sure what you're asking with regards to Analyzers and parsing XML... But for parsing and indexing XML documents with Lucene, you can find a lot of material out there by searching the list archives and using google. However, the document I found most helpful was this piece written by Ot