Re: Case Sensitivity

Otis Gospodnetic Tue, 26 Aug 2008 22:31:06 -0700

Dino, you lost me half-way through your email :(

NO_NORMS does not mean the field is not tokenized.
UN_TOKENIZED does mean the field is not tokenized.



Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Dino Korah <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, August 26, 2008 9:17:49 AM
> Subject: RE: Case Sensitivity
> 
> I think I should rephrase my question.
> 
> [ Context: Using out of the box StandardAnalyzer for indexing and searching.
> ]
> 
> Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-ized (
> field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
> Which means that when we search, it gets thru the analyzer and we need to
> analyze them differently in the analyzer we use for searching?
> Doesn't it mean that a setOmitNorms(true) field also doesn't get tokenized?
> 
> What is the best solution if one was to add a set of fields UN_TOKENIZED and
> others TOKENIZED, of the later set a few with setOmitNorms(true) (the index
> writer is plain StandardAnalyzer based)? A per field analyzer at query time
> ?!
> 
> Many thanks,
> Dino
> 
> 
> -----Original Message-----
> From: Dino Korah [mailto:[EMAIL PROTECTED] 
> Sent: 26 August 2008 12:12
> To: 'java-user@lucene.apache.org'
> Subject: RE: Case Sensitivity
> 
> A little more case sensitivity questions.
> 
> Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
> on this thread, is it right to say that a field, if either UN_TOKENIZED or
> NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
> to case-normalize (down-case) those fields before hand?
> 
> Doest it mean that if I can afford, I should use norms.
> 
> Many thanks,
> Dino
> 
> 
> 
> -----Original Message-----
> From: Steven A Rowe [mailto:[EMAIL PROTECTED]
> Sent: 19 August 2008 17:43
> To: java-user@lucene.apache.org
> Subject: RE: Case Sensitivity
> 
> Hi Dino,
> 
> I think you'd benefit from reading some FAQ answers, like:
> 
> "Why is it important to use the same analyzer type during indexing and
> search?"
> 
> 4472d10961ba63c>
> 
> Also, have a look at the AnalysisParalysis wiki page for some hints:
> 
> 
> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> > From the discussion here what I could understand was, if I am using 
> > StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying, 
> > I shouldn't have any problems with cases.
> 
> If by "shouldn't have problems with cases" you mean "can match
> case-insensitively", then this is true.
> 
> > But if I have any UN_TOKENIZED fields there will be problems if I do 
> > not case-normalize them myself before adding them as a field to the 
> > document.
> 
> Again, assuming that by "case-normalize" you mean "downcase", and that you
> want case-insensitive matching, and that you use the StandardAnalyzer (or
> some other downcasing analyzer) at query-time, then this is true.
> 
> > In my case I have a mixed scenario. I am indexing emails and the email 
> > addresses are indexed UN_TOKENIZED. I do have a second set of custom 
> > tokenized field, which keep the tokens in individual fields with same 
> > name.
> [...]
> > Does it mean that where ever I use UN_TOKENIZED, they do not get 
> > through the StandardAnalyzer before getting Indexed, but they do when 
> > they are searched on?
> 
> This is true.
> 
> > If that is the case, Do I need to normalise them before adding to 
> > document?
> 
> If you want case-insensitive matching, then yes, you do need to normalize
> them before adding them to the document.
> 
> > I also would like to know if it is better to employ an EmailAnalyzer 
> > that makes a TokenStream out of the given email address, rather than 
> > using a simplistic function that gives me a list of string pieces and 
> > adding them one by one. With searches, would both the approaches give 
> > same result?
> 
> Yes, both approaches give the same result.  When you add string pieces
> one-by-one, you are adding multiple same-named fields. By contrast, the
> EmailAnalyzer approach would add a single field, and would allow you to
> control positions (via Token.setPositionIncrement():
> 
> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.  Also, if
> you make up an EmailAnalyzer, you can use it to search against your
> tokenized email field, along with other analyzer(s) on other field(s), using
> the PerFieldAnalyzerWrapper
> 
> AnalyzerWrapper.html>.
> 
> Steve
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Case Sensitivity

Reply via email to