[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988646#comment-14988646
]
Jack Krupansky commented on LUCENE-6874:
----------------------------------------
bq. Because WST and WDF should really only be used as a last resort.
Absolutely agreed. From a Solr user perspective we really need a much simpler
model for semi-standard tokens out of the box without the user having to
scratch their heads and resorting to WST in the first (last) place. LOL - maybe
if we could eliminate this need to resort to WST, we wouldn't have to fret as
much about WST.
bq. I generally suggest to my users to use ClassicTokenizer
Personally, I've always refrained from recommending CT since I thought ST was
supposed to replace it and that the email and URL support was considered an
excess not worth keeping. I've considered CT as if it were deprecated (which it
is not.) And, I never see anybody else recommending it on the user list. And,
the fact that it can't handle slashes for product number is a deal killer. I'm
not sure that I would argue in favor of resurrecting CT as a first-class
recommendation, especially since it can't handle non-European languages, but...
That said, I do think it is worth separately (from this Jira) considering a
fresh, new tokenizer that starts with the goodness of ST and adds in an
approximation of the reasons that people resort to WST. Whether that can be an
option on ST or has to be a separate tokenizer would need to be debated. I'd
prefer an option on ST, either to simply allow embedded special characters or
to specify a list or regex of special character to be allowed or excluded.
People would still need to combine NewT with WDF, but at least the tokenization
would be more explicit.
Personally I would prefer to see an option for whether to retain or strip
external punctuation vs. embedded special characters. Trailing periods and
commas and columns and enclosing parentheses are just the kinds of things we
had to resort to WDF for when using WST to retain embedded special characters.
And if people really want to be ambitious, a totally new tokenizer that
subsumed the good parts of WDF would make a lot of lives of Solr users much
easier.
> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: David Smiley
> Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
> to decide what is whitespace. Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to
> work around but why leave this trap in by default?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]