RE: question on custom filter

OBender Mon, 20 Jul 2009 13:42:43 -0700

No, it reversed in the e-mail. Funny though, when I insert it in to the Excel 
it turns to the right order of words.
Thanks for all the help.


Maybe you have an idea on what could be the problem.
Here is how my data gets read and indexed.

I have a UTF-8 CSV file that is produced from Excel.
I read it in with Java (preserving UTF-8 encoding). At this point strings in 
the debugger look correct.
I insert it in to the DB (MySql) which is also UTF-8.
Then read it back and put in to index.

It looks like in UTF-8 CSV file the words are in "reverse" order from the 
grammar stand point (left to right, e.g., EREV left most then TOV). Should 
UTF-8 CSV file preserve the natural (language specific) order of words?

 
-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Monday, July 20, 2009 3:49 PM
To: java-user@lucene.apache.org
Subject: Re: question on custom filter

Obender, does the following text appear like the image in the link, or not?

שומר אחי

http://farm1.static.flickr.com/3/10445435_75b4546703.jpg?v=0


On Mon, Jul 20, 2009 at 3:34 PM, OBender<osya_ben...@hotmail.com> wrote:
> I've checked, and it appears to be enabled.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Monday, July 20, 2009 3:18 PM
> To: java-user@lucene.apache.org
> Subject: Re: question on custom filter
>
> Obender, based on your previous comments (that you see text displayed
> in the wrong order), I again recommend that you enable support for RTL
> languages in your operating system, as I mentioned earlier... are you
> using a Windows-based OS, this is not enabled by default!
>
> I think you are seeing things in the incorrect order, and this is
> causing confusion for you!
>
> On Mon, Jul 20, 2009 at 3:02 PM, Robert Muir<rcm...@gmail.com> wrote:
>> Obender, i ran your code and it did what I expected (but not what you 
>> pasted):
>>
>> First token is: (טוֹב,0,4)
>> Second token is: (עֶרֶב,5,10)
>>
>> I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same 
>> results.
>>
>> On Mon, Jul 20, 2009 at 2:53 PM, OBender<osya_ben...@hotmail.com> wrote:
>>> Here is the simple code. If you run it with English and with Hebrew you 
>>> will see that in case of English tokens returned from the left of the 
>>> phrase to the right and with Hebrew from the right to the left.
>>>
>>> Again I'm talking about tokens not the individual letters here.
>>>
>>> public class XFilter extends TokenFilter
>>> {
>>>        protected XFilter( TokenStream tokenStream ) {
>>>                super( tokenStream );
>>>        }
>>>
>>>        @Override
>>>        public Token next( final Token reusableToken ) throws IOException
>>>        {
>>>                Token nextToken = input.next( reusableToken );
>>>                System.out.println( nextToken != null? nextToken: "" );
>>>                return nextToken;
>>>        }
>>> }
>>>
>>> public class SimpleWhitespaceAnalyzer extends Analyzer
>>> {
>>>        @Override
>>>        public TokenStream tokenStream( final String fieldName, final Reader 
>>> reader )
>>>        {
>>>                TokenStream ts  = new WhitespaceTokenizer( reader );
>>>                ts                      = new XFilter( ts );
>>>
>>>                return ts;
>>>        }
>>> }
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcm...@gmail.com]
>>> Sent: Monday, July 20, 2009 2:26 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: question on custom filter
>>>
>>> Obender, I think something in your environment / display environment
>>> might be causing some confusion.
>>>
>>> Are you using microsoft windows? If so, please verify that support for
>>> right-to-left languages is enabled [control panel/regional and
>>> language options]. It is possible you are "seeing something different"
>>> because your rendering system is not actually rendering right-to-left
>>> text in right-to-left direction!!!!
>>>
>>> Second, Instead of using a debugger, I would recommend using Luke to
>>> look at resulting tokens from your analyzer.
>>>
>>> On Mon, Jul 20, 2009 at 2:21 PM, OBender<osya_ben...@hotmail.com> wrote:
>>>> This is how it should be written:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcm...@gmail.com]
>>>> Sent: Monday, July 20, 2009 2:07 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: question on custom filter
>>>>
>>>> Obender, This is not true.
>>>> the text you pasted is the following in unicode:
>>>>
>>>> \N{HEBREW LETTER TET}
>>>> \N{HEBREW LETTER VAV}
>>>> \N{HEBREW POINT HOLAM}
>>>> \N{HEBREW LETTER BET}
>>>> \N{SPACE}
>>>> \N{HEBREW LETTER AYIN}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER RESH}
>>>> \N{HEBREW POINT SEGOL}
>>>> \N{HEBREW LETTER BET}
>>>>
>>>> you can use this utility to see how your text is encoded:
>>>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91
>>>>
>>>> For more information on directionality in unicode, see
>>>> http://unicode.org/reports/tr9/
>>>>
>>>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<osya_ben...@hotmail.com> wrote:
>>>>> Robert,
>>>>>
>>>>> I'm not sure you are correct on this one.
>>>>>
>>>>> If I have a Hebrew phrase:
>>>>> [טוֹב עֶרֶב]
>>>>> Then first token that filter receives is:
>>>>> [עֶרֶב] (0,5)
>>>>> and the second is:
>>>>> [טוֹב] (6,10)
>>>>> Which means that it counts from right to left (words and indexes).
>>>>>
>>>>> Am I missing something?
>>>>>
>>>>> -----Original Message-----
>>>>> From: Robert Muir [mailto:rcm...@gmail.com]
>>>>> Sent: Monday, July 20, 2009 1:43 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: question on custom filter
>>>>>
>>>>> Obender, I don't think its as difficult as you think. Your filter does
>>>>> not need to be aware of this issue at all.
>>>>>
>>>>> In unicode, right-to-left languages are encoded in the data in logical 
>>>>> order.
>>>>> The rendering system is what converts it to display in right-to-left
>>>>> for RTL languages.
>>>>>
>>>>> For example in Arabic, "Robert 1234" displays as روبرت 1234
>>>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh,
>>>>> beh, waw, reh
>>>>>
>>>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4.
>>>>>
>>>>> 2009/7/20 OBender <osya_ben...@hotmail.com>:
>>>>>> Hi All!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Let say I have a filter that produces new tokens based on the original 
>>>>>> ones.
>>>>>>
>>>>>> How bad will it be if my filter sets the start of each token to 0 and 
>>>>>> end to
>>>>>> the length of a token?
>>>>>>
>>>>>> An example (based on the phrase "How are you?":
>>>>>>
>>>>>>
>>>>>>
>>>>>> Original token:
>>>>>>
>>>>>> [you?] (8,12)
>>>>>>
>>>>>>
>>>>>>
>>>>>> New tokens:
>>>>>>
>>>>>> [you] (0,3)
>>>>>>
>>>>>> [?] (0,1)
>>>>>>
>>>>>>
>>>>>>
>>>>>> It wouldn't be so hard to calculate the right numbers for left to right
>>>>>> languages and it is a bit more challenging to do it for right to left 
>>>>>> ones
>>>>>> but for mixed text it is quite hard.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Robert Muir
>>>>> rcm...@gmail.com
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcm...@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: question on custom filter

Reply via email to