Re: Best way to create own version of StandardTokenizer ?

Robert Muir Mon, 07 Sep 2009 07:19:21 -0700

> I think we would like to implement the complete unicode rules, so if you
> could provide us with some code that would be great.


ok, I will followup... what version of lucene are you using, 2.9?

...
> but having read the
> details it would seem to convert a half width character you would have to
> know you were looking at chinese (or korean/japanses ecetera) , but as the
> Musicbrainz system supports any language and the user doesn't specify the
> language being used  when searching

no, theres no language involved... why would you not simply apply the
filter all the time.
if i am looking at Ｔ (fullwidth character T), it should indexed as T
everytime (or later probably t if you are going to apply
lowercasefilter)

> I assume once again you have to know the script being used in order to do
> this

this is ok, because normalization, if you want to do it that way, is
definitely not language dependent!
its not like collation, where you have a locale 'parameter', its a
language-independent process.
http://unicode.org/reports/tr15/

> I think there are two issues, firstly the data needs to be indexed to always
> use gerhayim is this what you are suggesting I couldn't follow how to change
> jflex.

you are right, for you there are a couple issues.
first, i do not know what standardtokenizer does with
geresh/gershayim, forget about single quote/double quote.

but to fix the tokenization (gershayim example), you want to ensure
you do not split on these.
since this is used in hebrew acronym, i would modify the acronym rule to allow

[hebrew letter]+ (" | ״) [hebrew letter]+

next, if you want these to be indexed the same so that ארה"ב and ארה״ב
will match, you will need to create a tokenfilter
to standardize " to ״ for acronyms.

> Then its an issue for the query parser that the user uses a " for searching
> but doesn't escape it, but I cannot automatically escape it because it may
> not be Hebrew.

yes, you have a queryparser parsing ambiguity because " is also the
phrase operator.
I don't know what to recommend here off the top of my head... do you
allow phrase queries?

also as an fyi, when i say according to unicode they should be using
gershayim instead of double-quote, this is a little theoretical.
its not very user-friendly to expect users to use gershayim for input,
when its not even on hebrew keyboard layout...!

http://en.wikipedia.org/wiki/Hebrew_keyboard#Inaccessible_punctuation

-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best way to create own version of StandardTokenizer ?

Reply via email to