Re: Stop words in index

2006-09-03 Thread Chris Hostetter
: In the default StandardAnalyzer, the stop word list contains the word "on". : If I have a document which contains the phrase "Disney on Ice", the index : will show only "Disney" and "Ice", but not "on". : "Disney on Ice" : : With the quotations indicating the desire for an "exact match", the abs

Re: Stop words in index

2006-09-03 Thread Jason Polites
Hey, Just a quick addendum to this original issue. My first need was to ensure that stop words were not stored in the index (which your helpful suggestion led me to confim); however this has raised a second more scary issue. It seems that because stop words are excluded from the index, quoted s

Re: Indexing bigrams and trigrams in Lucene

2006-09-03 Thread Chris Hostetter
: This is a text document written by someone. Read this and post your comments : : words that must be indexed: : text : document ... : text document : document written typically when people talk about indexing n-grams -- they mean character wise (so they can find words with simple spellin

RE: word frequency list?

2006-09-03 Thread Dejan Nenov
Unfortunately the term search at the site is down - gives 500 internal server error. -Original Message- From: Dave Kor [mailto:[EMAIL PROTECTED] Sent: Sunday, September 03, 2006 9:22 PM To: java-user@lucene.apache.org Subject: Re: word frequency list? There is the Berkeley Web Term Frequ

Re: word frequency list?

2006-09-03 Thread Dave Kor
There is the Berkeley Web Term Frequency database which contains over 30 million unique terms extracted from 50 million webpages. http://elib.cs.berkeley.edu/docfreq/index.html On 8/31/06, Jason Pump <[EMAIL PROTECTED]> wrote: Is there a large list of words and their frequency in the english la

Indexing bigrams and trigrams in Lucene

2006-09-03 Thread Venkateshprasanna
I need to index bigrams and trigrams in a document. Here is an example: Text: This is a text document written by someone. Read this and post your comments words that must be indexed: text document written someone read post your comments text document document written post your your comments text

Re: How to combine multiple fields to a single field for indexing

2006-09-03 Thread KEGan
Thanks. I think I grasp the concept now :) On 8/27/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: On Aug 26, 2006, at 5:11 AM, KEGan wrote: > Erik, > > "Given the position increment gap between instances of same-named > fields that is now part of Lucene, I recommend using multiple field > instanc

Re: Phrase search using quotes -- special Tokenizer

2006-09-03 Thread Chris Hostetter
: Thanks for your input. I'm sure I could do as you suggest (and maybe that : will end up being my best option), but I had hoped to use a string for : creating the query object, particularly as some of my queries are a bit : complex. you have to clarify what you mean by "use a string for creatin

Re: Phrase search using quotes -- special Tokenizer

2006-09-03 Thread Philip Brown
Thanks for your input. I'm sure I could do as you suggest (and maybe that will end up being my best option), but I had hoped to use a string for creating the query object, particularly as some of my queries are a bit complex. Thanks. Chris Hostetter wrote: > > > I haven't really been followi

Re: Phrase search using quotes -- special Tokenizer

2006-09-03 Thread Erick Erickson
Yeah, what he said On 9/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: I haven't really been following this thread, but it's gotten so long i got interested. from whta i can tell skimming the discussion so far, it seems like the biggest confusion is about the definition of a "phrase" a

Re: Phrase search using quotes -- special Tokenizer

2006-09-03 Thread Chris Hostetter
I haven't really been following this thread, but it's gotten so long i got interested. from whta i can tell skimming the discussion so far, it seems like the biggest confusion is about the definition of a "phrase" and what analyzers do with "quote" characters and what the QueryParser does with "q

Re: Phrase search using quotes -- special Tokenizer

2006-09-03 Thread Philip Brown
Just as you, I would PREFER not to change any of the base Lucene code -- and I imagine there is still some way to do what I want (possibly by extending some other existing class) with what is already available. Regarding point 0) -- You are right in that if I add "test phrase" to index as UN_TO

Re: Phrase search using quotes -- special Tokenizer

2006-09-03 Thread Erick Erickson
Disclaimer: Of course I'm not as familiar with your problem space as you are, so I may be way out in left field, but... I *still* think you're making waay too much work for yourself and need to examine your assumptions. 0> But when you index something UN_TOKENIZED as in your example, I don't