Re: Best way to create own version of StandardTokenizer ?

Paul Taylor Mon, 07 Sep 2009 07:47:48 -0700

Robert Muir wrote:

I think we would like to implement the complete unicode rules, so if you
could provide us with some code that would be great.


ok, I will followup... what version of lucene are you using, 2.9?

...

Yes

but having read the
details it would seem to convert a half width character you would have to
know you were looking at chinese (or korean/japanses ecetera) , but as the
Musicbrainz system supports any language and the user doesn't specify the
language being used  when searching


no, theres no language involved... why would you not simply apply the
filter all the time.
if i am looking at Ｔ (fullwidth character T), it should indexed as T
everytime (or later probably t if you are going to apply
lowercasefilter)

I'm obviously misunderstanding I thought that Halfwidth was an encodingto allow storing the most common Chinese characters in a single byte,therefore the charcters would be read as different characters if youassumed they were using the HalfWidth Encoding rather than LatinEncoding. But are you saying Halfwidth characters are actually validUnicode characters with their own distinct unicode value so can justuse a CharFilter again to map these, please confirm.

I assume once again you have to know the script being used in order to do
this


this is ok, because normalization, if you want to do it that way, is
definitely not language dependent!
its not like collation, where you have a locale 'parameter', its a
language-independent process.
http://unicode.org/reports/tr15/

I think there are two issues, firstly the data needs to be indexed to always
use gerhayim is this what you are suggesting I couldn't follow how to change
jflex.


you are right, for you there are a couple issues.
first, i do not know what standardtokenizer does with
geresh/gershayim, forget about single quote/double quote.

but to fix the tokenization (gershayim example), you want to ensure
you do not split on these.
since this is used in hebrew acronym, i would modify the acronym rule to allow

[hebrew letter]+ (" | ״) [hebrew letter]+

next, if you want these to be indexed the same so that ארה"ב and ארה״ב
will match, you will need to create a tokenfilter
to standardize " to ״ for acronyms.

Oh I see , so we convert one to the other, but only when matchesACRONYM_TYPE

Then its an issue for the query parser that the user uses a " for searching
but doesn't escape it, but I cannot automatically escape it because it may
not be Hebrew.


yes, you have a queryparser parsing ambiguity because " is also the
phrase operator.
I don't know what to recommend here off the top of my head... do you
allow phrase queries?

Yes we do , we allow full Lucene syntax if the 'Advanced Query' optionis selected at http://musicbrainz.org/

also as an fyi, when i say according to unicode they should be using
gershayim instead of double-quote, this is a little theoretical.
its not very user-friendly to expect users to use gershayim for input,
when its not even on hebrew keyboard layout...!

http://en.wikipedia.org/wiki/Hebrew_keyboard#Inaccessible_punctuation

Understood, so I think users will continue to use the Double QuotesCharacter in their searches


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best way to create own version of StandardTokenizer ?

Reply via email to