> I think we would like to implement the complete unicode rules, so if you > could provide us with some code that would be great.
ok, I will followup... what version of lucene are you using, 2.9? ... > but having read the > details it would seem to convert a half width character you would have to > know you were looking at chinese (or korean/japanses ecetera) , but as the > Musicbrainz system supports any language and the user doesn't specify the > language being used when searching no, theres no language involved... why would you not simply apply the filter all the time. if i am looking at T (fullwidth character T), it should indexed as T everytime (or later probably t if you are going to apply lowercasefilter) > I assume once again you have to know the script being used in order to do > this this is ok, because normalization, if you want to do it that way, is definitely not language dependent! its not like collation, where you have a locale 'parameter', its a language-independent process. http://unicode.org/reports/tr15/ > I think there are two issues, firstly the data needs to be indexed to always > use gerhayim is this what you are suggesting I couldn't follow how to change > jflex. you are right, for you there are a couple issues. first, i do not know what standardtokenizer does with geresh/gershayim, forget about single quote/double quote. but to fix the tokenization (gershayim example), you want to ensure you do not split on these. since this is used in hebrew acronym, i would modify the acronym rule to allow [hebrew letter]+ (" | ״) [hebrew letter]+ next, if you want these to be indexed the same so that ארה"ב and ארה״ב will match, you will need to create a tokenfilter to standardize " to ״ for acronyms. > Then its an issue for the query parser that the user uses a " for searching > but doesn't escape it, but I cannot automatically escape it because it may > not be Hebrew. yes, you have a queryparser parsing ambiguity because " is also the phrase operator. I don't know what to recommend here off the top of my head... do you allow phrase queries? also as an fyi, when i say according to unicode they should be using gershayim instead of double-quote, this is a little theoretical. its not very user-friendly to expect users to use gershayim for input, when its not even on hebrew keyboard layout...! http://en.wikipedia.org/wiki/Hebrew_keyboard#Inaccessible_punctuation -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org