: Sorry for the confusion and thanks for taking the time to educate me. So, if
: I am just indexing literal values, what is the best way to do that (what
: analyzer)? Sounds like this approach, even though it works, is not the
: preferred method.
if you truely want just the literal values then
PROTECTED]>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: Re: Phrase search using quotes -- special Tokenizer
> :
> :
> : Here's a little sample program (borrowed some code from Erick Erickson
> :)).
> : Whether I add a
rom: Philip Brown <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Re: Phrase search using quotes -- special Tokenizer
:
:
: Here's a little sample program (borrowed some code from Erick Erickson :)).
: Whether I add as TOKENIZED or UN_
Here's a little sample program (borrowed some code from Erick Erickson :)).
Whether I add as TOKENIZED or UN_TOKENIZED seems to make no difference in
the output. Is this what you'd expect?
- Philip
package com.test;
import java.io.IOException;
import java.util.HashSet;
import java.util.regex.
Some info to help you on you're journey :)
1. If you add a field as untokenized then it will not be analyzed when added
to the index. However, QueryParser will not know that this happened and will
tokenize queries on that field.
2. The solution that Hoss has explained to you is to leave the defa
: So, if I do as you suggest below (using PerFieldAnalyzerWrapper with
: StandardAnalyzer) then I still need to enclose in quotes the phrases
: (keywords with spaces) when I issue the search, and they are only returned
Yes, quotes will be neccessary to tell the QueryParser "this
is one chunk of t
So, if I do as you suggest below (using PerFieldAnalyzerWrapper with
StandardAnalyzer) then I still need to enclose in quotes the phrases
(keywords with spaces) when I issue the search, and they are only returned
in the results if the case is identical to how it was added? (This seems to
be what
: Yeah, they are more complex than the "exactish" match -- basically, there are
: more fields involved -- combined sometimes with AND and sometimes with OR,
: and sometimes negated field values, sometimes groupings, etc. These other
: field values are all single words (no spaces), and a search mi
More to consider:
perhaps there is some way to get what you want by overriding
getFieldQuery(String, String) instead. I have not been able to come up
with anything simple off the top of my head, but overriding
getFieldQuery would free you from having to make that line change on
every Lucene up
Yeah, they are more complex than the "exactish" match -- basically, there are
more fields involved -- combined sometimes with AND and sometimes with OR,
and sometimes negated field values, sometimes groupings, etc. These other
field values are all single words (no spaces), and a search might invo
Keeping in mind that Hoss's input is much more valuable than mine...
It sounds like you want what I originally tgave you. You want to be able
to perform complex queries with the QueryParser and you want '-' and '_'
to not break words, and you want quoted words to be tokenized as one
token with
: Thanks for your input. I'm sure I could do as you suggest (and maybe that
: will end up being my best option), but I had hoped to use a string for
: creating the query object, particularly as some of my queries are a bit
: complex.
you have to clarify what you mean by "use a string for creatin
Thanks for your input. I'm sure I could do as you suggest (and maybe that
will end up being my best option), but I had hoped to use a string for
creating the query object, particularly as some of my queries are a bit
complex.
Thanks.
Chris Hostetter wrote:
>
>
> I haven't really been followi
Yeah, what he said
On 9/3/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
I haven't really been following this thread, but it's gotten so long
i got interested.
from whta i can tell skimming the discussion so far, it seems like the
biggest confusion is about the definition of a "phrase" a
I haven't really been following this thread, but it's gotten so long
i got interested.
from whta i can tell skimming the discussion so far, it seems like the
biggest confusion is about the definition of a "phrase" and what analyzers
do with "quote" characters and what the QueryParser does with "q
Just as you, I would PREFER not to change any of the base Lucene code -- and
I imagine there is still some way to do what I want (possibly by extending
some other existing class) with what is already available.
Regarding point 0) -- You are right in that if I add "test phrase" to index
as UN_TO
Disclaimer: Of course I'm not as familiar with your problem space as you
are, so I may be way out in left field, but...
I *still* think you're making waay too much work for yourself and need
to examine your assumptions.
0> But when you index something UN_TOKENIZED as in your example, I don't
I tend to agree with Mark. I tried a query as so...
TermQuery query = new TermQuery(new Term("keywordField", "phrase test"));
IndexSearcher searcher= new IndexSearcher(activeIdx);
Hits hits = searcher.search(query);
And this produced the expected results. Whe
I think if he wants to use the queryparser to parse his search strings
that he has no choice but to modify it. It will eat any pair of quotes
going through it no matter what analyzer is used.
- Mark
Well, you're flying blind. Is the behavior rooted in the indexing or
querying? Since you can't
Well, you're flying blind. Is the behavior rooted in the indexing or
querying? Since you can't answer that, you're reduced to trying random
things hoping that one of them works. A little like voodoo. I've wasted
farr too much time trying to solve what I was *sure* was the problem
only to f
Well Philip...bad news. I should have thought of this before...I think
the query parser is the problem. You are tokening "all in the quotes" to
one token...but when QueryParser sees that, it doesnt matter what
analyzer you use, it's going to see the quotes and strip them right off
. Then it pas
I am out of ideas. If I'm feeling perky I'll build you one in the morning.
No, I've never used Luke. Is there an easy way to examine my RAMDirectory
index? I can create the index with no quoted keywords, and when I search
for a keyword, I get back the expected results (just can't search for a
p
No, I've never used Luke. Is there an easy way to examine my RAMDirectory
index? I can create the index with no quoted keywords, and when I search
for a keyword, I get back the expected results (just can't search for a
phrase that has whitespace in it). If I create the index with phrases in
quo
OK, I've gotta ask. Have you examined your index with Luke to see if what
you *think* is in the index actually *is*???
Erick
On 9/1/06, Philip Brown <[EMAIL PROTECTED]> wrote:
Interesting...just ran a test where I put double quotes around everything
(including single keywords) of source text
Interesting...just ran a test where I put double quotes around everything
(including single keywords) of source text and then ran searches for a known
keyword with and without double quotes -- doesn't find either time.
Mark Miller-5 wrote:
>
> Sorry to hear you're having trouble. You indeed nee
Added the to the other section and reran the javacc and imported the
new files...but, I still get the same result -- no results. (Quotes are in
the source text and query string.) Anything else I might be missing?
Philip
Mark Miller-5 wrote:
>
> Sorry to hear you're having trouble. You indee
Sorry to hear you're having trouble. You indeed need the double quotes in
the source text. You will also need them in the query string. Make sure they
are in both places. My machine is hosed right now or I would do it for you
real quick. My guess is that I forgot to mention...no only do you need t
Well, I tried that, and it doesn't seem to work still. I would be happy to
zip up the new files, so you can see what I'm using -- maybe you can get it
to work. The first time, I tried building the documents without quotes
surrounding each phrase. Then, I retried by enclosing every phrase within
That is a good point. I was just thinking that it would be a pain for
searchers to have to include the quotes when searching, but I guess
there is little way around it. The best you could do is have an option
that specified a quoted search...and you might as well make that option
be to put the
Thanks, but I don't "think" I need that. But curious, how will it know it's
a phrase if it's not enclosed in quotes? Won't all its terms be treated
separately then?
Philip
Mark Miller-5 wrote:
>
> One more tip...if you would like to be able to search phrases without
> putting in the quotes
One more tip...if you would like to be able to search phrases without
putting in the quotes you must strip them with the analyzer. In
standardfilter (in the standard analyzer code) add this:
private static final String QUOTED_TYPE = tokenImage[QUOTED];
- youll see where to put that
and youll s
So this will recognize anything in quotes as a single token and '_' and
'-' will not break up words. There may be some repercussions for the NUM
token but nothing I'd worry about. maybe you want to use Unicode for '-'
and '_' as well...I wouldn't worry about it myself.
- Mark
TOKEN : {
Philip Brown wrote:
Do you mean StandardTokenizer.jj (org.apache.lucene.analysis.standard)? I'm
not seeing StandardAnalyzer.jj in the Lucene source download.
Mark Miller-5 wrote:
Philip Brow
Do you mean StandardTokenizer.jj (org.apache.lucene.analysis.standard)? I'm
not seeing StandardAnalyzer.jj in the Lucene source download.
Mark Miller-5 wrote:
>
> Philip Brown wrote:
>> Hi,
>>
>
Philip Brown wrote:
Hi,
After running some tests using the StandardAnalyzer, and getting 0 results
from the search, I believe I need a special Tokenizer/Analyzer. Does
anybody have something that parses like the following:
- doesn't parse apart phrases (in quotes)
- doesn't parse/separate hyph
35 matches
Mail list logo