Re: Searching Tokenized x Un_tokenized

Andre Rubin Wed, 13 Aug 2008 12:54:23 -0700

Thanks Otis,

I created a custom analyzer and it's working fine.


Here's my analyzer, for reference:

public class KeywordLowerAnalyzer extends Analyzer{

          public KeywordLowerAnalyzer() {
          }

          public TokenStream tokenStream(String fieldName, Reader reader) {
            TokenStream result = new KeywordTokenizer(reader);
            result = new LowerCaseFilter(result);
            return result;
          }

}

Cheers


Andre

On Tue, Aug 12, 2008 at 9:22 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Perhaps you can lowercase the text prior to passing it to Lucene?
> Or perhaps you can have a custom Analyzer that treats the whole input as 1 
> Token (see KeywordAnalyzer -- 
> http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/KeywordAnalyzer.html
>  ), but also includes LowerCaseFilter that's applied to that 1 Token.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Andre Rubin <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, August 13, 2008 12:15:25 AM
>> Subject: Re: Searching Tokenized x Un_tokenized
>>
>> Thanks Otis, that was exactly what was happening.
>>
>> 1) According to here:
>> http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a
>> wildcard queries are not passed through the Analyzer, but they are
>> always set to lower case.
>>
>> 2) And according to here:
>> http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c44472d10961ba63c
>> un_tokenized fields are not passed through the Analyze as well.
>>
>> So by creating an untokenized field and setting
>> parser.setLowercaseExpandedTerms(false), I manage to make my use case
>> work in a case-sensitive manner. That is, 'u*' returns 'usa' and 'U*'
>> returns USA....
>>
>> The thing is, how to make this case-insensitive? I can make #1 work by
>> settting it to lowercase: parser.setLowercaseExpandedTerms(true). But
>> how make #2 work, that is, using a LowerCaseFilter to an Untokenized
>> field?
>>
>> Thanks,
>>
>>
>> Andre
>>
>> On Tue, Aug 12, 2008 at 7:57 PM, Otis Gospodnetic
>> wrote:
>> > Andre,
>> >
>> > Check the Lucene FAQ, there is an entry about wildcards and analysis (which
>> doesn't take place for wildcard queries).  Could that be it?
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Andre Rubin
>> >> To: java-user@lucene.apache.org
>> >> Sent: Tuesday, August 12, 2008 5:30:47 PM
>> >> Subject: Re: Searching Tokenized x Un_tokenized
>> >>
>> >> My searches for my String tokenized field was working properly. I
>> >> switched the field to un_tokenized, rebuilt the index, and now my
>> >> searches only return strings that match the query string in lower
>> >> case.
>> >>
>> >> For example, searching for 'us*':
>> >>
>> >> The tokenized field version would find 'USA' and 'usa'
>> >>
>> >> The untokenized field version only finds 'usa'
>> >>
>> >> I'm using the StandardAnalyzer in both cases.
>> >>
>> >> Thanks
>> >>
>> >>
>> >> Andre
>> >>
>> >> On Thu, Aug 7, 2008 at 8:16 PM, Otis Gospodnetic
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Perhaps you can give some examples.  Yes, untokenized means "full 
>> >> > string" -
>> it
>> >> requires an "exact match".
>> >> >
>> >> > Otis
>> >> > --
>> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message ----
>> >> >> From: Andre Rubin
>> >> >> To: java-user@lucene.apache.org
>> >> >> Sent: Thursday, August 7, 2008 8:04:04 PM
>> >> >> Subject: Searching Tokenized x Un_tokenized
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> When I switched a String field from tokenized to untokenized, some
>> >> >> searches started not returning some obvious values. Am I missing
>> >> >> something on querying untokenized fields? Another question is, do I
>> >> >> need an Analyzer if my search is on an Untokenized field, wouldn't the
>> >> >> search be based on the full String rather than its tokens?
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >>
>> >> >> Andre
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching Tokenized x Un_tokenized

Reply via email to