Hi David, While I agree on the quirkiness, at least it's documented (and the method is probably kept as-is for backwards compatibility reasons); the first bullet point of the corresponding Java SE 7 page says: "It is a Unicode space character [...] but is not also a non-breaking space" [1].
You can still override the isTokenChar method of WhitespaceTokenizer or CharTokenizer in a subclass to exclude an extra set of characters from the allowed range. If you are using Google's Guava library in your project, they have a character matching predicate class which follows the Unicode specification more closely [2]; this can also be used in isTokenChar as a replacement. [1] http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char) [2] http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/base/CharMatcher.html#WHITESPACE On Fri, Oct 30, 2015 at 9:10 PM, david.w.smi...@gmail.com < david.w.smi...@gmail.com> wrote: > One would think that all “space characters” are by definition > “whitespace”. Not true!: > http://www.fileformat.info/info/unicode/char/00a0/index.htm > > So I’m working on an app where I can no longer use WhitespaceTokenizer > since I need to check for isSpacheChar *OR* isWhitespace. Alternatively I > could use MappingCharFilter, I realize. > > This had trickle-down effects on a search platform I’m working on that was > triggered by a user’s search. It’s caused all sorts of head-scratching > till we discovered what’s going on. > > Craziness. > > ~ David > -- > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > http://www.solrenterprisesearchserver.com > -- András Péteri