[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985540#comment-14985540
]
Jack Krupansky edited comment on LUCENE-6874 at 11/2/15 5:34 PM:
-----------------------------------------------------------------
+1 for using the Unicode definition of white space rather than the (odd) Java
definition. From a Solr user perspective, the fact that Java is used for
implementation under the hood should be irrelevant. That said, the Javadoc for
WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already.
The term "non-breaking white space" explicitly refers to line breaking and has
no mention of tokens in either Unicode or traditional casual usage.
>From a Solr user perspective, there is like zero value to having NBSP from
>HTML web pages being treated as if it were not traditional white space.
>From a Solr user perspective, the primary use of whitespace tokenizer is to
>avoid the fact that standard tokenizer breaks on various special characters
>such as occur in product numbers.
One of the ongoing problems in the Solr community is the sheer amount of time
spent explaining nuances and gotchas, even if they do happen to be documented
somewhere in the fine print - no sane user reads the fine print anyway. No Solr
user actually uses WhitespaceTokenizer directly - they reference
WhitespaceTokenizerFactory, and then having to drop down to Lucene and Java for
doc is way too much to ask a typical Solr user. Our collective goal should be
to minimize nuances and gotchas (IMHO.)
In short, the benefits to Solr users for NBSP being tokenized as white space
seem to outweigh any minor use cases for treating it as non-white space. A
compatibility mode can be provided if those minor use cases are considered
truly worthwhile.
Ugh... there are plenty of other places in doc for other tokenizers and filters
that refer to "whitespace" and need to address this same issue, either to treat
NBSP as white space or doc the nuance/gotcha much more thoroughly and
effectively.
OTOH... an alternative view... having so many un/poorly-documented nuances and
gotchas is money in the pockets of consultants and a great argument in favor of
Solr users maximizing the employment of Solr consultants.
was (Author: jkrupan):
+1 for using the Unicode definition of white space rather than the (odd) Java
definition. From a Solr user perspective, the fact that Java is used for
implementation under the hood should be irrelevant. That said, the Javadoc for
WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already.
The term "non-breaking white space" explicitly refers to line breaking and has
no mention of tokens in either Unicode or traditional casual usage.
>From a Solr user perspective, there is like zero value to having NBSP from
>HTML web pages being treated as if it were not traditional white space.
>From a Solr user perspective, the primary use of whitespace tokenizer is to
>avoid the fact that standard tokenizer breaks on various special characters
>such as occur in product numbers.
In short, the benefits to Solr users for NBSP being tokenized as white space
seem to outweigh any minor use cases for treating it as non-white space. A
compatibility mode can be provided if those minor use cases are considered
truly worthwhile.
> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: David Smiley
> Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
> to decide what is whitespace. Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to
> work around but why leave this trap in by default?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]