[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002248#comment-15002248
]
Uwe Schindler commented on LUCENE-6874:
---------------------------------------
Here is the output of the reuters test:
{noformat}
------------> Report Sum By (any) Name and Round (28 about 33 out of 34)
Operation round runCnt
recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
AnalyzerFactory(name:WhitespaceTokenizer,WhitespaceTokenizer(rule:java))
0 1 0 0.00 0.00 9,569,344
124,256,256
AnalyzerFactory(name:UnicodeWhitespaceTokenizer,WhitespaceTokenizer(rule:unicode))
- 0 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 9,569,344 -
124,256,256
Rounds_5 0 1
24493540 360,841.19 67.88 16,566,472 124,256,256
NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - -
- - 0 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 9,569,344 -
124,256,256
[Character.isWhitespace()] WhitespaceTokenizer
0 1 2449354 331,038.53 7.40 22,121,256
124,256,256
Seq_20000 - - - - - - - - - - - - - - - - - - 0 - - 2 - -
2449354 - 344,131.22 - - 14.23 - 22,121,256 - 118,489,088
NewAnalyzer(UnicodeWhitespaceTokenizer)
0 1 0 0.00 0.00 22,121,256
112,721,920
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - -
- - 0 - - 1 - - 2449354 - 358,302.22 - - 6.84 - 22,121,256 -
112,721,920
NewAnalyzer(WhitespaceTokenizer)
1 1 0 0.00 0.00 12,138,024
112,721,920
[Character.isWhitespace()] WhitespaceTokenizer - - - - - - - - - - -
- - 1 - - 1 - - 2449354 - 366,724.66 - - 6.68 - 22,374,536 -
112,721,920
Seq_20000 1 2
2449354 365,139.25 13.42 27,477,352 117,702,656
NewAnalyzer(UnicodeWhitespaceTokenizer) - - - - - - - - - - - - -
- - - 1 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 22,374,536 -
111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer
1 1 2449354 363,567.47 6.74 32,580,168
122,683,392
NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - -
- - 2 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 32,580,168 -
122,683,392
[Character.isWhitespace()] WhitespaceTokenizer
2 1 2449354 365,793.59 6.70 33,461,280
122,683,392
Seq_20000 - - - - - - - - - - - - - - - - - - 2 - - 2 - -
2449354 - 365,112.03 - - 13.42 - 33,461,280 - 117,178,368
NewAnalyzer(UnicodeWhitespaceTokenizer)
2 1 0 0.00 0.00 33,461,280
111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - -
- - 2 - - 1 - - 2449354 - 364,432.97 - - 6.72 - 33,461,280 -
111,673,344
NewAnalyzer(WhitespaceTokenizer)
3 1 0 0.00 0.00 10,836,464
111,673,344
[Character.isWhitespace()] WhitespaceTokenizer - - - - - - - - - - -
- - 3 - - 1 - - 2449354 - 367,660.47 - - 6.66 - 12,451,400 -
111,673,344
Seq_20000 3 2
2449354 365,820.94 13.39 13,235,672 111,673,344
NewAnalyzer(UnicodeWhitespaceTokenizer) - - - - - - - - - - - - -
- - - 3 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 12,451,400 -
111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer
3 1 2449354 363,999.69 6.73 14,019,944
111,673,344
NewAnalyzer(WhitespaceTokenizer) - - - - - - - - - - - - - - - -
- - 4 - - 1 - - - - 0 - - - 0.00 - - 0.00 - 14,019,944 -
111,673,344
[Character.isWhitespace()] WhitespaceTokenizer
4 1 2449354 367,329.62 6.67 15,061,368
111,673,344
Seq_20000 - - - - - - - - - - - - - - - - - - 4 - - 2 - -
2449354 - 365,057.59 - - 13.42 - 15,813,920 - 111,673,344
NewAnalyzer(UnicodeWhitespaceTokenizer)
4 1 0 0.00 0.00 15,061,368
111,673,344
[UnicodeProps.WHITESPACE.get()] UnicodeWhitespaceTokenizer - - - - - - -
- - 4 - - 1 - - 2449354 - 362,813.50 - - 6.75 - 16,566,472 -
111,673,344
{noformat}
As you see, both Tokenizers are almost same speed.
> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: David Smiley
> Priority: Minor
> Attachments: LUCENE-6874-chartokenizer.patch,
> LUCENE-6874-chartokenizer.patch, LUCENE-6874-jflex.patch, LUCENE-6874.patch,
> LUCENE_6874_jflex.patch, icu-datasucker.patch, unicode-ws-tokenizer.patch,
> unicode-ws-tokenizer.patch, unicode-ws-tokenizer.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
> to decide what is whitespace. Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to
> work around but why leave this trap in by default?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]