RE: question on custom filter

OBender Mon, 20 Jul 2009 11:54:20 -0700

Here is the simple code. If you run it with English and with Hebrew you will 
see that in case of English tokens returned from the left of the phrase to the 
right and with Hebrew from the right to the left.


Again I'm talking about tokens not the individual letters here.

public class XFilter extends TokenFilter
{
        protected XFilter( TokenStream tokenStream ) {
                super( tokenStream );
        }

        @Override
        public Token next( final Token reusableToken ) throws IOException
        {
                Token nextToken = input.next( reusableToken );
                System.out.println( nextToken != null? nextToken: "" );
                return nextToken;
        }
}

public class SimpleWhitespaceAnalyzer extends Analyzer
{
        @Override
        public TokenStream tokenStream( final String fieldName, final Reader 
reader )
        {
                TokenStream ts  = new WhitespaceTokenizer( reader );
                ts                      = new XFilter( ts );

                return ts;
        }
}

-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, July 20, 2009 2:26 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, I think something in your environment / display environment
might be causing some confusion.

Are you using microsoft windows? If so, please verify that support for
right-to-left languages is enabled [control panel/regional and
language options]. It is possible you are "seeing something different"
because your rendering system is not actually rendering right-to-left
text in right-to-left direction!!!!

Second, Instead of using a debugger, I would recommend using Luke to
look at resulting tokens from your analyzer.

On Mon, Jul 20, 2009 at 2:21 PM, OBender<osya_ben...@hotmail.com> wrote:
> This is how it should be written:
> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>
> -----Original Message-----
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Monday, July 20, 2009 2:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, This is not true.
> the text you pasted is the following in unicode:
>
> \N{HEBREW LETTER TET}
> \N{HEBREW LETTER VAV}
> \N{HEBREW POINT HOLAM}
> \N{HEBREW LETTER BET}
> \N{SPACE}
> \N{HEBREW LETTER AYIN}
> \N{HEBREW POINT SEGOL}
> \N{HEBREW LETTER RESH}
> \N{HEBREW POINT SEGOL}
> \N{HEBREW LETTER BET}
>
> you can use this utility to see how your text is encoded:
> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>
> For more information on directionality in unicode, see
> http://unicode.org/reports/tr9/
>
> On Mon, Jul 20, 2009 at 1:59 PM, OBender<osya_ben...@hotmail.com> wrote:
>> Robert,
>>
>> I'm not sure you are correct on this one.
>>
>> If I have a Hebrew phrase:
>> [טוֹב עֶרֶב]
>> Then first token that filter receives is:
>> [עֶרֶב] (0,5)
>> and the second is:
>> [טוֹב] (6,10)
>> Which means that it counts from right to left (words and indexes).
>>
>> Am I missing something?
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcm...@gmail.com]
>> Sent: Monday, July 20, 2009 1:43 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: question on custom filter
>>
>> Obender, I don't think its as difficult as you think. Your filter does
>> not need to be aware of this issue at all.
>>
>> In unicode, right-to-left languages are encoded in the data in logical order.
>> The rendering system is what converts it to display in right-to-left
>> for RTL languages.
>>
>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>> beh, waw, reh
>>
>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>
>> 2009/7/20 OBender <osya_ben...@hotmail.com>:
>>> Hi All!
>>>
>>>
>>>
>>> Let say I have a filter that produces new tokens based on the original ones.
>>>
>>> How bad will it be if my filter sets the start of each token to 0 and end to
>>> the length of a token?
>>>
>>> An example (based on the phrase "How are you?":
>>>
>>>
>>>
>>> Original token:
>>>
>>> [you?] (8,12)
>>>
>>>
>>>
>>> New tokens:
>>>
>>> [you] (0,3)
>>>
>>> [?] (0,1)
>>>
>>>
>>>
>>> It wouldn't be so hard to calculate the right numbers for left to right
>>> languages and it is a bit more challenging to do it for right to left ones
>>> but for mixed text it is quite hard.
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: question on custom filter

Reply via email to