Robert Muir wrote:
I think we would like to implement the complete unicode rules, so if you
could provide us with some code that would be great.

ok, I will followup... what version of lucene are you using, 2.9?

...
Yes
but having read the
details it would seem to convert a half width character you would have to
know you were looking at chinese (or korean/japanses ecetera) , but as the
Musicbrainz system supports any language and the user doesn't specify the
language being used  when searching

no, theres no language involved... why would you not simply apply the
filter all the time.
if i am looking at T (fullwidth character T), it should indexed as T
everytime (or later probably t if you are going to apply
lowercasefilter)

I'm obviously misunderstanding I thought that Halfwidth was an encoding to allow storing the most common Chinese characters in a single byte, therefore the charcters would be read as different characters if you assumed they were using the HalfWidth Encoding rather than Latin Encoding. But are you saying Halfwidth characters are actually valid Unicode characters with their own distinct unicode value so can just use a CharFilter again to map these, please confirm.
I assume once again you have to know the script being used in order to do
this

this is ok, because normalization, if you want to do it that way, is
definitely not language dependent!
its not like collation, where you have a locale 'parameter', its a
language-independent process.
http://unicode.org/reports/tr15/

I think there are two issues, firstly the data needs to be indexed to always
use gerhayim is this what you are suggesting I couldn't follow how to change
jflex.

you are right, for you there are a couple issues.
first, i do not know what standardtokenizer does with
geresh/gershayim, forget about single quote/double quote.

but to fix the tokenization (gershayim example), you want to ensure
you do not split on these.
since this is used in hebrew acronym, i would modify the acronym rule to allow

[hebrew letter]+ (" | ״) [hebrew letter]+

next, if you want these to be indexed the same so that ארה"ב and ארה״ב
will match, you will need to create a tokenfilter
to standardize " to ״ for acronyms.
Oh I see , so we convert one to the other, but only when matches ACRONYM_TYPE
Then its an issue for the query parser that the user uses a " for searching
but doesn't escape it, but I cannot automatically escape it because it may
not be Hebrew.

yes, you have a queryparser parsing ambiguity because " is also the
phrase operator.
I don't know what to recommend here off the top of my head... do you
allow phrase queries?
Yes we do , we allow full Lucene syntax if the 'Advanced Query' option is selected at http://musicbrainz.org/
also as an fyi, when i say according to unicode they should be using
gershayim instead of double-quote, this is a little theoretical.
its not very user-friendly to expect users to use gershayim for input,
when its not even on hebrew keyboard layout...!

http://en.wikipedia.org/wiki/Hebrew_keyboard#Inaccessible_punctuation

Understood, so I think users will continue to use the Double Quotes Character in their searches

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to