Hi Mohammad,
Mohammad Norouzi wrote:
> [Hoss wrote:]
>> ...are there Persian characters with a category type of SPACE_SEPARATOR,
>> LINE_SEPARATOR, or PARAGRAPH_SEPARATOR ?
>
> How can I know that?
The Unicode standard's codes[1] for these are:
SPACE SEPARATOR: Zs
LINE SEPARATOR: Zl
PA
Hi Chris,
* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
'\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL
: return !Character.isWhitespace(c);
: And my class override that method as this:
: return !((int)c==32);
in my opinion that's a pretty naive change ... it won't split on tab
characters or newlines ... even for trivial ASCII text that's probably not
what you want.
: I think the Charact
Sorry Steven
that change is in WhitespaceTokenizer not WhiteSpaceAnalyzer but in Analyzer
I had to call the tokenizer
On 5/24/07, Mohammad Norouzi <[EMAIL PROTECTED]> wrote:
Hi Steven
Thank you so much for your thorough comments about Analyzer
I write that class a couple of months ago, now I
Hi Steven
Thank you so much for your thorough comments about Analyzer
I write that class a couple of months ago, now I take a look at my
customized Analyzer
the only change I've made as follows:
the original class has this method:
protected boolean isTokenChar(char c) {
return !Character.isW
Hi Mohammad,
WhitespaceAnalyzer uses Java's Character.isWhitespace(char) method to
determine whether or not a character should be part of a token. As far
as I know, this method is problematic only for characters outside of the
Basic Multilingual Plane (BMP). I think Lucene should switch to using
Wow, very nice comments
Thank you so much Erick. You really showed me the way
--
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
You may have to index things twice, once for searching and once
UN_TOKENIZED for display. Say you have a bunch of service names
you want to display
service one
service two
service three
If you use WhitespaceAnalyzer, TOKENIZED you index the tokens
service (note, there are three of these)
one
two
Hi Walter,
let me explain my problem in detail
I have a web page let user to create his own query simple
for example a user want to locate a service with specific value. so he/she
doesnt know exactly the name of the service so I have to provide a list of
services available (say in a combo box) and
Hi Steve,
No I didn't make any change on WhiteSpaceAnalyzer I just extends my classes
from the original classes and then override my new changes. so I dont think
I should to contribute my classes.
and my language is Persian, and only change I've made is not to ignoring
unicode characters in Persi
Hi Mohammad,
May I ask what your language is? And what kind of changes to
WhitespaceAnalyzer were required to make it work with your language?
If you have made modifications to WhitespaceAnalyzer that are generally
useful, please consider contributing your changes back to the Lucene
project. Th
You have to turn on term vectors when indexing. Take a look at the
Field constructor that passes in TermVector.
-Grant
On May 22, 2007, at 8:09 AM, Mohammad Norouzi wrote:
I would use a term vector to get this. See
IndexReader.getTermFreqVector. You can get the term vector for just
field
I would use a term vector to get this. See
IndexReader.getTermFreqVector. You can get the term vector for just
field 3.
Grant, thanks, in my case, getTermFreqVector returns null, I dont know why
it accepts a docnumber as parameter, what is it? is that the same doc id?
if yes it restrict the r
I would use a term vector to get this. See
IndexReader.getTermFreqVector. You can get the term vector for just
field 3.
-Grant
On May 22, 2007, at 5:29 AM, Mohammad Norouzi wrote:
Hi all
consider following index
field1 field2 field3
text1
Let's suppose you modify your WhitespaceAnalyzer not to use a
WhitespaceTokenizer, but a modified version of the Tokenizer which
token-ize not by space but by something else, like '/'. (this is just an
example of course).
So suppose your real txt document contain :
/text2 text3/text4 text5/text6
Wh
Walter,
Yes I am using a customized WhiteSpaceAnalyzer while indexing.
I said customized because I realized that standard WhiteSpaceAnalyzer dont
accept unicode terms in my language so I make some change to support that.
but for reading no Analyzer is used
if I want to get that result, which ana
If Reader.terms() gives you:
text3
text4
while you expect
text3 text4
you should change, I presume, the Analyzer, maybe writing your own one.
Mohammad Norouzi wrote:
> Hi all
>
> consider following index
>
> field1 field2 field3
> text1 text1 text2
Hi all
consider following index
field1 field2 field3
text1 text1 text2 text3 text4
text4 text2 text2 text3 text5
I want to get all terms in filed3
if I use Reader.terms() it will returns
18 matches
Mail list logo