[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997057#comment-14997057
 ] 

David Smiley commented on LUCENE-6874:
--------------------------------------

Sorry, I really disagree with you on this.  I don't think this 
WhitespaceTokenizerFactory is hard to maintain at all.  It's true that it's 
harder only because it was a trivial factory before but so what?  Most 
importantly, I think it's a better user experience -- nobody should care what 
the specific Java Tokenizer implementation class will be coming out of the 
factory -- it's a tokenizer on whitespace using whatever definition/rule of 
whitespace they configured.  That could hypothetically be implemented using one 
Java Tokenizer implementing class or multiple but that's an implementation 
detail.

bq. Why is the ICUWhitespace being added?

I'll remove that in a new patch; I wasn't sure what to do but it's redundant so 
no need for it.



> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to