George Rhoten created LUCENE-5110:
-------------------------------------
Summary: DefaultICUTokenizerConfig should use the default ICU
behavior for the Khmer script
Key: LUCENE-5110
URL: https://issues.apache.org/jira/browse/LUCENE-5110
Project: Lucene - Core
Issue Type: Bug
Components: modules/other
Affects Versions: 4.0
Reporter: George Rhoten
Recent versions of ICU have their own implementation for the tokenization of
the Khmer script. Lucene should not be overriding ICU's behavior any more.
I haven't tried the patch out, but the patch should look something like the
following:
$ diff DefaultICUTokenizerConfig.java.orig DefaultICUTokenizerConfig.java
67,68d66
< private static final BreakIterator thaiBreakIterator =
< BreakIterator.getWordInstance(new ULocale("th_TH"));
71,72d68
< private static final BreakIterator khmerBreakIterator =
< readBreakIterator("Khmer.brk");
87d82
< case UScript.THAI: return (BreakIterator)thaiBreakIterator.clone();
89d83
< case UScript.KHMER: return (BreakIterator)khmerBreakIterator.clone();
and the Khmer.* files should be removed. ICU already does script specific
tokenization these days. So the Thai one should not be needed either since ICU
50.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]