[jira] Issue Comment Edited: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Steven Rowe (JIRA) Sat, 15 May 2010 09:46:06 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867889#action_12867889
 ]


Steven Rowe edited comment on LUCENE-2167 at 5/15/10 12:44 PM:
---------------------------------------------------------------

New patch addressing the following issues:

* On #lucene-dev, Uwe mentioned that methods in the generated scanner should be 
(package) private, since unlike the current StandardTokenizer, UAX29Tokenizer 
is not hidden behind a facade class. I added JFlex's %apiprivate option to fix 
this issue.
* Thai, Lao, Khmer, Myanmar and other scripts' characters are now kept 
together, like the ICU UAX#29 implementation, using rule [:Line_Break = 
Complex_Context:]+.
* Added the Thai test back from Robert's TestICUTokenizer.
* Added full-width numeric characters to the {NumericEx} macro, so that they 
can be appropriately tokenized, just like full-width alpha characters are now.

I couldn't find any suitable Lao test text (mostly because I don't know Lao at 
all), so I left out the Lao test in TestICUTokenizer, because Robert mentioned 
on #lucene that its characters are not in logical order.

*edit* Complex_Content --> Complex_Context
*edit #2* Added bullet about full-width numerics issue

      was (Author: steve_rowe):
    New patch addressing the following issues:

* On #lucene-dev, Uwe mentioned that methods in the generated scanner should be 
(package) private, since unlike the current StandardTokenizer, UAX29Tokenizer 
is not hidden behind a facade class. I added JFlex's %apiprivate option to fix 
this issue.
* Thai, Lao, Khmer, Myanmar and other scripts' characters are now kept 
together, like the ICU UAX#29 implementation, using rule [:Line_Break = 
Complex_Context:]+.
* Added the Thai test back from Robert's TestICUTokenizer.

I couldn't find any suitable Lao test text (mostly because I don't know Lao at 
all), so I left out the Lao test in TestICUTokenizer, because Robert mentioned 
on #lucene that its characters are not in logical order.

*edit* Complex_Content --> Complex_Context
  
> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>         Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to