: In the default StandardAnalyzer, the stop word list contains the word "on".
: If I have a document which contains the phrase "Disney on Ice", the index
: will show only "Disney" and "Ice", but not "on".
: "Disney on Ice"
:
: With the quotations indicating the desire for an "exact match", the abs
Hey,
Just a quick addendum to this original issue.
My first need was to ensure that stop words were not stored in the index
(which your helpful suggestion led me to confim); however this has raised a
second more scary issue.
It seems that because stop words are excluded from the index, quoted
s
: This is a text document written by someone. Read this and post your comments
:
: words that must be indexed:
: text
: document
...
: text document
: document written
typically when people talk about indexing n-grams -- they mean character
wise (so they can find words with simple spellin
Unfortunately the term search at the site is down - gives 500 internal
server error.
-Original Message-
From: Dave Kor [mailto:[EMAIL PROTECTED]
Sent: Sunday, September 03, 2006 9:22 PM
To: java-user@lucene.apache.org
Subject: Re: word frequency list?
There is the Berkeley Web Term Frequ
There is the Berkeley Web Term Frequency database which contains over
30 million unique terms extracted from 50 million webpages.
http://elib.cs.berkeley.edu/docfreq/index.html
On 8/31/06, Jason Pump <[EMAIL PROTECTED]> wrote:
Is there a large list of words and their frequency in the english
la
I need to index bigrams and trigrams in a document. Here is an example:
Text:
This is a text document written by someone. Read this and post your comments
words that must be indexed:
text
document
written
someone
read
post
your
comments
text document
document written
post your
your comments
text
Thanks. I think I grasp the concept now :)
On 8/27/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:
On Aug 26, 2006, at 5:11 AM, KEGan wrote:
> Erik,
>
> "Given the position increment gap between instances of same-named
> fields that is now part of Lucene, I recommend using multiple field
> instanc
: Thanks for your input. I'm sure I could do as you suggest (and maybe that
: will end up being my best option), but I had hoped to use a string for
: creating the query object, particularly as some of my queries are a bit
: complex.
you have to clarify what you mean by "use a string for creatin
Thanks for your input. I'm sure I could do as you suggest (and maybe that
will end up being my best option), but I had hoped to use a string for
creating the query object, particularly as some of my queries are a bit
complex.
Thanks.
Chris Hostetter wrote:
>
>
> I haven't really been followi
Yeah, what he said
On 9/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
I haven't really been following this thread, but it's gotten so long
i got interested.
from whta i can tell skimming the discussion so far, it seems like the
biggest confusion is about the definition of a "phrase" a
I haven't really been following this thread, but it's gotten so long
i got interested.
from whta i can tell skimming the discussion so far, it seems like the
biggest confusion is about the definition of a "phrase" and what analyzers
do with "quote" characters and what the QueryParser does with "q
Just as you, I would PREFER not to change any of the base Lucene code -- and
I imagine there is still some way to do what I want (possibly by extending
some other existing class) with what is already available.
Regarding point 0) -- You are right in that if I add "test phrase" to index
as UN_TO
Disclaimer: Of course I'm not as familiar with your problem space as you
are, so I may be way out in left field, but...
I *still* think you're making waay too much work for yourself and need
to examine your assumptions.
0> But when you index something UN_TOKENIZED as in your example, I don't
13 matches
Mail list logo