[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Jack Krupansky (JIRA) Tue, 03 Nov 2015 17:08:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988646#comment-14988646
 ]


Jack Krupansky commented on LUCENE-6874:
----------------------------------------

bq. Because WST and WDF should really only be used as a last resort.

Absolutely agreed. From a Solr user perspective we really need a much simpler 
model for semi-standard tokens out of the box without the user having to 
scratch their heads and resorting to WST in the first (last) place. LOL - maybe 
if we could eliminate this need to resort to WST, we wouldn't have to fret as 
much about WST.

bq.  I generally suggest to my users to use ClassicTokenizer

Personally, I've always refrained from recommending CT since I thought ST was 
supposed to replace it and that the email and URL support was considered an 
excess not worth keeping. I've considered CT as if it were deprecated (which it 
is not.) And, I never see anybody else recommending it on the user list. And, 
the fact that it can't handle slashes for product number is a deal killer. I'm 
not sure that I would argue in favor of resurrecting CT as a first-class 
recommendation, especially since it can't handle non-European languages, but...

That said, I do think it is worth separately (from this Jira) considering a 
fresh, new tokenizer that starts with the goodness of ST and adds in an 
approximation of the reasons that people resort to WST. Whether that can be an 
option on ST or has to be a separate tokenizer would need to be debated. I'd 
prefer an option on ST, either to simply allow embedded special characters or 
to specify a list or regex of special character to be allowed or excluded.

People would still need to combine NewT with WDF, but at least the tokenization 
would be more explicit.

Personally I would prefer to see an option for whether to retain or strip 
external punctuation vs. embedded special characters. Trailing periods and 
commas and columns and enclosing parentheses are just the kinds of things we 
had to resort to WDF for when using WST to retain embedded special characters.

And if people really want to be ambitious, a totally new tokenizer that 
subsumed the good parts of WDF would make a lot of lives of Solr users much 
easier.

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to