Ok, it makes a lot of sense (the input being incorrect). Let's just verify that :)
At the end of the line: "but the text you sent as an example was" what I see is word TOV [טוֹב] on the left and EREV [עֶרֶב] on the right. So it reads (for me) EREV TOV which is correct. At the end of the line: " Shouldn't the adjective follow the noun like this " what I see is the word EREV [עֶרֶב] on the left and TOV [טוֹב] on the right. So it reads (for me) TOV EREV which is not correct. Is the above the way you see the Hebrew text or it is other way around for you :) ? -----Original Message----- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, July 20, 2009 3:34 PM To: java-user@lucene.apache.org Subject: Re: question on custom filter Obender, I think your input is incorrect. The hebrew text you pasted in your example appears incorrect. Its gonna be hard for me to communicate this since I think your computer is not displaying hebrew correctly :) but the text you sent as an example was [טוֹב עֶרֶב] Shouldn't the adjective follow the noun like this: עֶרֶב טוֹב This makes me think your input is incorrect because its being rendered incorrectly, as I mentioned this isn't enabled by default in windows. But your input appears correct to you :) On Mon, Jul 20, 2009 at 3:29 PM, OBender<osya_ben...@hotmail.com> wrote: > Interesting, the question now is why am I seeing (even in println) what I'm > seeing :) > I'm reading a string from the file which is in UTF-8 encoding. Could this > somehow be related...? > > -----Original Message----- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Monday, July 20, 2009 3:03 PM > To: java-user@lucene.apache.org > Subject: Re: question on custom filter > > Obender, i ran your code and it did what I expected (but not what you pasted): > > First token is: (טוֹב,0,4) > Second token is: (עֶרֶב,5,10) > > I also loaded up your SimpleWhitespaceAnalyzer in Luke, with the same results. > > On Mon, Jul 20, 2009 at 2:53 PM, OBender<osya_ben...@hotmail.com> wrote: >> Here is the simple code. If you run it with English and with Hebrew you will >> see that in case of English tokens returned from the left of the phrase to >> the right and with Hebrew from the right to the left. >> >> Again I'm talking about tokens not the individual letters here. >> >> public class XFilter extends TokenFilter >> { >> protected XFilter( TokenStream tokenStream ) { >> super( tokenStream ); >> } >> >> @Override >> public Token next( final Token reusableToken ) throws IOException >> { >> Token nextToken = input.next( reusableToken ); >> System.out.println( nextToken != null? nextToken: "" ); >> return nextToken; >> } >> } >> >> public class SimpleWhitespaceAnalyzer extends Analyzer >> { >> @Override >> public TokenStream tokenStream( final String fieldName, final Reader >> reader ) >> { >> TokenStream ts = new WhitespaceTokenizer( reader ); >> ts = new XFilter( ts ); >> >> return ts; >> } >> } >> >> -----Original Message----- >> From: Robert Muir [mailto:rcm...@gmail.com] >> Sent: Monday, July 20, 2009 2:26 PM >> To: java-user@lucene.apache.org >> Subject: Re: question on custom filter >> >> Obender, I think something in your environment / display environment >> might be causing some confusion. >> >> Are you using microsoft windows? If so, please verify that support for >> right-to-left languages is enabled [control panel/regional and >> language options]. It is possible you are "seeing something different" >> because your rendering system is not actually rendering right-to-left >> text in right-to-left direction!!!! >> >> Second, Instead of using a debugger, I would recommend using Luke to >> look at resulting tokens from your analyzer. >> >> On Mon, Jul 20, 2009 at 2:21 PM, OBender<osya_ben...@hotmail.com> wrote: >>> This is how it should be written: >>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%A2%D6%B6%D7%A8%D6%B6%D7%91+%D7%98%D7%95%D6%B9%D7%91 >>> >>> -----Original Message----- >>> From: Robert Muir [mailto:rcm...@gmail.com] >>> Sent: Monday, July 20, 2009 2:07 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: question on custom filter >>> >>> Obender, This is not true. >>> the text you pasted is the following in unicode: >>> >>> \N{HEBREW LETTER TET} >>> \N{HEBREW LETTER VAV} >>> \N{HEBREW POINT HOLAM} >>> \N{HEBREW LETTER BET} >>> \N{SPACE} >>> \N{HEBREW LETTER AYIN} >>> \N{HEBREW POINT SEGOL} >>> \N{HEBREW LETTER RESH} >>> \N{HEBREW POINT SEGOL} >>> \N{HEBREW LETTER BET} >>> >>> you can use this utility to see how your text is encoded: >>> http://unicode.org/cldr/utility/transform.jsp?a=name&b=%D7%98%D7%95%D6%B9%D7%91+%D7%A2%D6%B6%D7%A8%D6%B6%D7%91 >>> >>> For more information on directionality in unicode, see >>> http://unicode.org/reports/tr9/ >>> >>> On Mon, Jul 20, 2009 at 1:59 PM, OBender<osya_ben...@hotmail.com> wrote: >>>> Robert, >>>> >>>> I'm not sure you are correct on this one. >>>> >>>> If I have a Hebrew phrase: >>>> [טוֹב עֶרֶב] >>>> Then first token that filter receives is: >>>> [עֶרֶב] (0,5) >>>> and the second is: >>>> [טוֹב] (6,10) >>>> Which means that it counts from right to left (words and indexes). >>>> >>>> Am I missing something? >>>> >>>> -----Original Message----- >>>> From: Robert Muir [mailto:rcm...@gmail.com] >>>> Sent: Monday, July 20, 2009 1:43 PM >>>> To: java-user@lucene.apache.org >>>> Subject: Re: question on custom filter >>>> >>>> Obender, I don't think its as difficult as you think. Your filter does >>>> not need to be aware of this issue at all. >>>> >>>> In unicode, right-to-left languages are encoded in the data in logical >>>> order. >>>> The rendering system is what converts it to display in right-to-left >>>> for RTL languages. >>>> >>>> For example in Arabic, "Robert 1234" displays as روبرت 1234 >>>> To your computer monitor, this looks like 1, 2, 3, 4, space, teh, reh, >>>> beh, waw, reh >>>> >>>> But the unicode text is reh, waw, beh, reh, teh, space, 1, 2, 3, 4. >>>> >>>> 2009/7/20 OBender <osya_ben...@hotmail.com>: >>>>> Hi All! >>>>> >>>>> >>>>> >>>>> Let say I have a filter that produces new tokens based on the original >>>>> ones. >>>>> >>>>> How bad will it be if my filter sets the start of each token to 0 and end >>>>> to >>>>> the length of a token? >>>>> >>>>> An example (based on the phrase "How are you?": >>>>> >>>>> >>>>> >>>>> Original token: >>>>> >>>>> [you?] (8,12) >>>>> >>>>> >>>>> >>>>> New tokens: >>>>> >>>>> [you] (0,3) >>>>> >>>>> [?] (0,1) >>>>> >>>>> >>>>> >>>>> It wouldn't be so hard to calculate the right numbers for left to right >>>>> languages and it is a bit more challenging to do it for right to left ones >>>>> but for mixed text it is quite hard. >>>>> >>>>> >>>>> >>>>> Thanks. >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Robert Muir >>>> rcm...@gmail.com >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >>> >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org