Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

2007-05-29 Thread Steven Rowe
Hi Mohammad, Mohammad Norouzi wrote: > [Hoss wrote:] >> ...are there Persian characters with a category type of SPACE_SEPARATOR, >> LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ? > > How can I know that? The Unicode standard's codes[1] for these are: SPACE SEPARATOR: Zs LINE SEPARATOR: Zl PA

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

2007-05-28 Thread Mohammad Norouzi
Hi Chris, * It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F'). * It is '\u0009', HORIZONTAL TABULATION. * It is '\u000A', LINE FEED. * It is '\u000B', VERTICAL

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

2007-05-24 Thread Chris Hostetter
: return !Character.isWhitespace(c); : And my class override that method as this: : return !((int)c==32); in my opinion that's a pretty naive change ... it won't split on tab characters or newlines ... even for trivial ASCII text that's probably not what you want. : I think the Charact

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

2007-05-23 Thread Mohammad Norouzi
Sorry Steven that change is in WhitespaceTokenizer not WhiteSpaceAnalyzer but in Analyzer I had to call the tokenizer On 5/24/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote: Hi Steven Thank you so much for your thorough comments about Analyzer I write that class a couple of months ago, now I

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

2007-05-23 Thread Mohammad Norouzi
Hi Steven Thank you so much for your thorough comments about Analyzer I write that class a couple of months ago, now I take a look at my customized Analyzer the only change I've made as follows: the original class has this method: protected boolean isTokenChar(char c) { return !Character.isW

WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

2007-05-23 Thread Steven Rowe
Hi Mohammad, WhitespaceAnalyzer uses Java's Character.isWhitespace(char) method to determine whether or not a character should be part of a token. As far as I know, this method is problematic only for characters outside of the Basic Multilingual Plane (BMP). I think Lucene should switch to using

Re: regaridng Reader.terms()

2007-05-23 Thread Mohammad Norouzi
Wow, very nice comments Thank you so much Erick. You really showed me the way -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/

Re: regaridng Reader.terms()

2007-05-23 Thread Erick Erickson
You may have to index things twice, once for searching and once UN_TOKENIZED for display. Say you have a bunch of service names you want to display service one service two service three If you use WhitespaceAnalyzer, TOKENIZED you index the tokens service (note, there are three of these) one two

Re: regaridng Reader.terms()

2007-05-22 Thread Mohammad Norouzi
Hi Walter, let me explain my problem in detail I have a web page let user to create his own query simple for example a user want to locate a service with specific value. so he/she doesnt know exactly the name of the service so I have to provide a list of services available (say in a combo box) and

Re: regaridng Reader.terms()

2007-05-22 Thread Mohammad Norouzi
Hi Steve, No I didn't make any change on WhiteSpaceAnalyzer I just extends my classes from the original classes and then override my new changes. so I dont think I should to contribute my classes. and my language is Persian, and only change I've made is not to ignoring unicode characters in Persi

Re: regaridng Reader.terms()

2007-05-22 Thread Steven Rowe
Hi Mohammad, May I ask what your language is? And what kind of changes to WhitespaceAnalyzer were required to make it work with your language? If you have made modifications to WhitespaceAnalyzer that are generally useful, please consider contributing your changes back to the Lucene project. Th

Re: regaridng Reader.terms()

2007-05-22 Thread Grant Ingersoll
You have to turn on term vectors when indexing. Take a look at the Field constructor that passes in TermVector. -Grant On May 22, 2007, at 8:09 AM, Mohammad Norouzi wrote: I would use a term vector to get this. See IndexReader.getTermFreqVector. You can get the term vector for just field

Re: regaridng Reader.terms()

2007-05-22 Thread Mohammad Norouzi
I would use a term vector to get this. See IndexReader.getTermFreqVector. You can get the term vector for just field 3. Grant, thanks, in my case, getTermFreqVector returns null, I dont know why it accepts a docnumber as parameter, what is it? is that the same doc id? if yes it restrict the r

Re: regaridng Reader.terms()

2007-05-22 Thread Grant Ingersoll
I would use a term vector to get this. See IndexReader.getTermFreqVector. You can get the term vector for just field 3. -Grant On May 22, 2007, at 5:29 AM, Mohammad Norouzi wrote: Hi all consider following index field1 field2 field3 text1

Re: regaridng Reader.terms()

2007-05-22 Thread Walter Ferrara
Let's suppose you modify your WhitespaceAnalyzer not to use a WhitespaceTokenizer, but a modified version of the Tokenizer which token-ize not by space but by something else, like '/'. (this is just an example of course). So suppose your real txt document contain : /text2 text3/text4 text5/text6 Wh

Re: regaridng Reader.terms()

2007-05-22 Thread Mohammad Norouzi
Walter, Yes I am using a customized WhiteSpaceAnalyzer while indexing. I said customized because I realized that standard WhiteSpaceAnalyzer dont accept unicode terms in my language so I make some change to support that. but for reading no Analyzer is used if I want to get that result, which ana

Re: regaridng Reader.terms()

2007-05-22 Thread Walter Ferrara
If Reader.terms() gives you: text3 text4 while you expect text3 text4 you should change, I presume, the Analyzer, maybe writing your own one. Mohammad Norouzi wrote: > Hi all > > consider following index > > field1 field2 field3 > text1 text1 text2

regaridng Reader.terms()

2007-05-22 Thread Mohammad Norouzi
Hi all consider following index field1 field2 field3 text1 text1 text2 text3 text4 text4 text2 text2 text3 text5 I want to get all terms in filed3 if I use Reader.terms() it will returns