Robert,

I'm not sure you are correct on this one.

If I have a Hebrew phrase:
[טוֹב עֶרֶב]
Then first token that filter receives is:
[עֶרֶב] (0,5)
and the second is:
[טוֹב] (6,10)
Which means that it counts from right to left (words and indexes).

Am I missing something?

-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, July 20, 2009 1:43 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, I don't think its as difficult as you think. Your filter does
not need to be aware of this issue at all.

In unicode, right-to-left languages are encoded in the data in logical order.
The rendering system is what converts it to display in right-to-left
for RTL languages.

For example in Arabic, "Robert 1234" displays as روبرت 1234
To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
beh, waw, reh

But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.

2009/7/20 OBender <osya_ben...@hotmail.com>:
> Hi All!
>
>
>
> Let say I have a filter that produces new tokens based on the original ones.
>
> How bad will it be if my filter sets the start of each token to 0 and end to
> the length of a token?
>
> An example (based on the phrase "How are you?":
>
>
>
> Original token:
>
> [you?] (8,12)
>
>
>
> New tokens:
>
> [you] (0,3)
>
> [?] (0,1)
>
>
>
> It wouldn't be so hard to calculate the right numbers for left to right
> languages and it is a bit more challenging to do it for right to left ones
> but for mixed text it is quite hard.
>
>
>
> Thanks.
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to