Re: Searching Tokenized x Un_tokenized

Otis Gospodnetic Tue, 12 Aug 2008 21:24:38 -0700

Perhaps you can lowercase the text prior to passing it to Lucene?
Or perhaps you can have a custom Analyzer that treats the whole input as 1 
Token (see KeywordAnalyzer -- 
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/KeywordAnalyzer.html
 ), but also includes LowerCaseFilter that's applied to that 1 Token.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Andre Rubin <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Wednesday, August 13, 2008 12:15:25 AM
> Subject: Re: Searching Tokenized x Un_tokenized
> 
> Thanks Otis, that was exactly what was happening.
> 
> 1) According to here:
> http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35a
> wildcard queries are not passed through the Analyzer, but they are
> always set to lower case.
> 
> 2) And according to here:
> http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c44472d10961ba63c
> un_tokenized fields are not passed through the Analyze as well.
> 
> So by creating an untokenized field and setting
> parser.setLowercaseExpandedTerms(false), I manage to make my use case
> work in a case-sensitive manner. That is, 'u*' returns 'usa' and 'U*'
> returns USA....
> 
> The thing is, how to make this case-insensitive? I can make #1 work by
> settting it to lowercase: parser.setLowercaseExpandedTerms(true). But
> how make #2 work, that is, using a LowerCaseFilter to an Untokenized
> field?
> 
> Thanks,
> 
> 
> Andre
> 
> On Tue, Aug 12, 2008 at 7:57 PM, Otis Gospodnetic
> wrote:
> > Andre,
> >
> > Check the Lucene FAQ, there is an entry about wildcards and analysis (which 
> doesn't take place for wildcard queries).  Could that be it?
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Andre Rubin 
> >> To: java-user@lucene.apache.org
> >> Sent: Tuesday, August 12, 2008 5:30:47 PM
> >> Subject: Re: Searching Tokenized x Un_tokenized
> >>
> >> My searches for my String tokenized field was working properly. I
> >> switched the field to un_tokenized, rebuilt the index, and now my
> >> searches only return strings that match the query string in lower
> >> case.
> >>
> >> For example, searching for 'us*':
> >>
> >> The tokenized field version would find 'USA' and 'usa'
> >>
> >> The untokenized field version only finds 'usa'
> >>
> >> I'm using the StandardAnalyzer in both cases.
> >>
> >> Thanks
> >>
> >>
> >> Andre
> >>
> >> On Thu, Aug 7, 2008 at 8:16 PM, Otis Gospodnetic
> >> wrote:
> >> > Hi,
> >> >
> >> > Perhaps you can give some examples.  Yes, untokenized means "full 
> >> > string" - 
> it
> >> requires an "exact match".
> >> >
> >> > Otis
> >> > --
> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> >> From: Andre Rubin
> >> >> To: java-user@lucene.apache.org
> >> >> Sent: Thursday, August 7, 2008 8:04:04 PM
> >> >> Subject: Searching Tokenized x Un_tokenized
> >> >>
> >> >> Hi all,
> >> >>
> >> >> When I switched a String field from tokenized to untokenized, some
> >> >> searches started not returning some obvious values. Am I missing
> >> >> something on querying untokenized fields? Another question is, do I
> >> >> need an Analyzer if my search is on an Untokenized field, wouldn't the
> >> >> search be based on the full String rather than its tokens?
> >> >>
> >> >> Thanks,
> >> >>
> >> >>
> >> >> Andre
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> > For additional commands, e-mail: [EMAIL PROTECTED]
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching Tokenized x Un_tokenized

Reply via email to