(opennlp) 01/01: OPENNLP-1850 Document Unicode normalization and the UAX #29 tokenizer

kristian Sat, 20 Jun 2026 05:36:41 -0700

This is an automated email from the ASF dual-hosted git repository.

krickert pushed a commit to branch OPENNLP-1850-4-docs
in repository https://gitbox.apache.org/repos/asf/opennlp.git


commit 3037db7b21d1acfe96102f5be3baeb8e22afa5e9
Author: Kristian Rickert <[email protected]>
AuthorDate: Sat Jun 20 08:06:57 2026 -0400

    OPENNLP-1850 Document Unicode normalization and the UAX #29 tokenizer
    
    Adds the normalizer manual chapter and updates the tokenizer, doccat, 
namefinder,
    and introduction chapters (and the master opennlp.xml) to cover the new
    normalization pipeline and word tokenizer.
---
 opennlp-docs/src/docbkx/doccat.xml       |  18 ++
 opennlp-docs/src/docbkx/introduction.xml |   3 +-
 opennlp-docs/src/docbkx/namefinder.xml   |  27 +-
 opennlp-docs/src/docbkx/normalizer.xml   | 532 +++++++++++++++++++++++++++++++
 opennlp-docs/src/docbkx/opennlp.xml      |   1 +
 opennlp-docs/src/docbkx/tokenizer.xml    |  91 +++++-
 6 files changed, 669 insertions(+), 3 deletions(-)

diff --git a/opennlp-docs/src/docbkx/doccat.xml 
b/opennlp-docs/src/docbkx/doccat.xml
index 7d03f1c2a..e12186ec4 100644
--- a/opennlp-docs/src/docbkx/doccat.xml
+++ b/opennlp-docs/src/docbkx/doccat.xml
@@ -171,6 +171,24 @@ String category = 
myCategorizer.getBestCategory(outcomes);]]>
                                </programlisting>
                                For additional examples, refer to the 
<code>DocumentCategorizerDLEval</code> class.
                        </para>
+                       <para>
+                               Like <code>NameFinderDL</code>, long input is 
split into overlapping chunks on the full
+                               Unicode <code>White_Space</code> set rather 
than Java's <code>\s</code>, so text copied
+                               from PDFs, the web, or multilingual sources 
tokenizes consistently. Optional
+                               preprocessing through 
<code>InferenceOptions</code> is off by default:
+                               <code>setNormalizeWhitespace(true)</code> maps 
each Unicode whitespace code point to an
+                               ASCII space, and 
<code>setNormalizeDashes(true)</code> maps Unicode dashes to the ASCII
+                               hyphen-minus. Both are one-to-one replacements 
that preserve character offsets. See
+                               <xref linkend="tools.normalizer"/> for the 
shared <code>CharClass</code> engine and the
+                               full normalization library.
+                       </para>
+                       <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);
+options.setNormalizeDashes(true);
+DocumentCategorizerDL categorizer = new DocumentCategorizerDL(
+    model, vocab, categories, scoringStrategy, options);]]>
+                       </programlisting>
                </section>
        </section>
 
diff --git a/opennlp-docs/src/docbkx/introduction.xml 
b/opennlp-docs/src/docbkx/introduction.xml
index e7ac5c7c3..82e53cccb 100644
--- a/opennlp-docs/src/docbkx/introduction.xml
+++ b/opennlp-docs/src/docbkx/introduction.xml
@@ -303,7 +303,8 @@ Arguments description:
                 and <xref linkend="tools.doccat">Document Categorizer</xref>. 
This allows models trained by other frameworks
                 such as PyTorch and Tensorflow to be used by OpenNLP. The 
documentation for
                 each of the OpenNLP components that supports ONNX models 
describes how to
-                use ONNX models for inference.
+                use ONNX models for inference. DL inference uses Unicode-aware 
text chunking and
+                optional input normalization; see <xref 
linkend="tools.normalizer.dl"/>.
             </para>
             <note>
                 <para>
diff --git a/opennlp-docs/src/docbkx/namefinder.xml 
b/opennlp-docs/src/docbkx/namefinder.xml
index ff695d898..6c2c759c0 100644
--- a/opennlp-docs/src/docbkx/namefinder.xml
+++ b/opennlp-docs/src/docbkx/namefinder.xml
@@ -157,11 +157,36 @@ Span[] nameSpans = nameFinder.find(sentence);]]>
 File vocab = new File("/path/to/vocab.txt");
 Map<Integer, String> categories = new HashMap<>();
 String[] tokens = new String[]{"George", "Washington", "was", "president", 
"of", "the", "United", "States", "."};
-NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, 
getIds2Labels());
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, getIds2Labels(), 
sentenceDetector);
 Span[] spans = nameFinderDL.find(tokens);]]>
                                        </programlisting>
                                        For additional examples, refer to the 
<code>NameFinderDLEval</code> class.
                                </para>
+                               <para>
+                                       Long input text is split into 
overlapping chunks on the full Unicode
+                                       <code>White_Space</code> set before 
WordPiece tokenization, so spacing such as a
+                                       no-break space or the CJK ideographic 
space is recognized as a delimiter. After
+                                       inference, reconstructed entity text is 
matched back to the caller's original input
+                                       with a Unicode-aware cursor scan (not a 
regular expression), so
+                                       <code>Span#getCoveredText(...)</code> 
returns the source text even when WordPiece
+                                       rejoins sub-tokens with spaces or when 
the source uses non-ASCII whitespace between
+                                       tokens.
+                               </para>
+                               <para>
+                                       Optional preprocessing of the joined 
input text is available through
+                                       <code>InferenceOptions</code> and is 
off by default:
+                                       
<code>setNormalizeWhitespace(true)</code> folds each Unicode whitespace 
character to
+                                       an ASCII space, and 
<code>setNormalizeDashes(true)</code> folds Unicode dashes to the
+                                       ASCII hyphen-minus. Both transforms are 
one code point to one character and preserve
+                                       offsets. Full details, the underlying 
<code>CharClass</code> engine, and the broader
+                                       normalization pipeline are documented 
in <xref linkend="tools.normalizer"/>.
+                               </para>
+                               <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);
+options.setNormalizeDashes(true);
+NameFinderDL finder = new NameFinderDL(model, vocab, ids2Labels, options, 
sentenceDetector);]]>
+                               </programlisting>
                        </section>
        </section>
        </section>
diff --git a/opennlp-docs/src/docbkx/normalizer.xml 
b/opennlp-docs/src/docbkx/normalizer.xml
new file mode 100644
index 000000000..55376f538
--- /dev/null
+++ b/opennlp-docs/src/docbkx/normalizer.xml
@@ -0,0 +1,532 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN"
+"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd";[
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more 
contributor
+       license agreements. See the NOTICE file distributed with this work for 
additional
+       information regarding copyright ownership. The ASF licenses this file to
+       you under the Apache License, Version 2.0 (the "License"); you may not 
use
+       this file except in compliance with the License. You may obtain a copy 
of
+       the License at http://www.apache.org/licenses/LICENSE-2.0 Unless 
required
+       by applicable law or agreed to in writing, software distributed under 
the
+       License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR 
CONDITIONS
+       OF ANY KIND, either express or implied. See the License for the specific
+       language governing permissions and limitations under the License. -->
+
+<chapter xml:id="tools.normalizer">
+
+       <title>Text Normalization</title>
+
+       <section xml:id="tools.normalizer.introduction">
+               <title>Introduction</title>
+               <para>
+                       The package <code>opennlp.tools.util.normalizer</code> 
provides Unicode-aware text
+                       normalization for matching, search, and tokenization 
preprocessing. It cleans up the
+                       kinds of inconsistency that real text carries when it 
is copied from the web, PDFs,
+                       office documents, or multilingual sources: spacing that 
is not an ordinary space, the
+                       many dash and quotation variants, decomposed versus 
precomposed accents, non-ASCII
+                       digits, and invisible control characters.
+               </para>
+               <para>
+                       The implementation follows three principles:
+               </para>
+               <itemizedlist>
+                       <listitem>
+                               <para>
+                                       <emphasis 
role="bold">Standards-sourced.</emphasis> Membership sets come from the
+                                       Unicode Character Database (for example 
the <code>White_Space</code> and
+                                       <code>Dash</code> properties), not from 
the JVM's locale-dependent or quirky
+                                       character predicates. The library never 
relies on
+                                       <code>Character.isWhitespace</code>, 
which disagrees with the Unicode standard.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <emphasis role="bold">Cursor-based, no 
regular expressions.</emphasis> Every
+                                       operation is a single forward pass over 
the input that tests membership in O(1)
+                                       and advances by code point. This avoids 
the allocation and the catastrophic
+                                       backtracking (ReDoS) risk of regular 
expressions, and it correctly recognizes
+                                       Unicode characters that Java's 
<code>\s</code> does not.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <emphasis 
role="bold">Offset-preserving.</emphasis> The original text is always
+                                       the source of truth. Normalization 
produces a derived form for matching while the
+                                       original character offsets are kept, so 
a search hit can be reported and
+                                       highlighted against the source even 
when the normalized form has a different
+                                       length.
+                               </para>
+                       </listitem>
+               </itemizedlist>
+               <para>
+                       Two engines underpin everything: the 
<code>CharSequenceNormalizer</code> family offers
+                       ready-made, composable normalizers, and the 
<code>CharClass</code> engine is the low-level,
+                       configurable building block they are made of. Built on 
these are three higher-level
+                       features documented below: a layered term model that 
projects a token through a
+                       configurable stack of transforms while keeping every 
intermediate form (see
+                       <xref linkend="tools.normalizer.term"/>), per-language 
profiles that select the transforms
+                       appropriate to a language (see <xref 
linkend="tools.normalizer.language"/>), and confusable
+                       folding that reduces lookalike characters for matching 
(see
+                       <xref linkend="tools.normalizer.confusables"/>).
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.normalizers">
+               <title>The normalizer family</title>
+               <para>
+                       Each normalizer implements the existing
+                       
<code>opennlp.tools.util.normalizer.CharSequenceNormalizer</code> interface
+                       (<code>CharSequence normalize(CharSequence)</code>) and 
is a shared, stateless singleton
+                       obtained through <code>getInstance()</code>. They can 
therefore be combined with the
+                       existing <code>AggregateCharSequenceNormalizer</code>, 
or with the
+                       <code>TextNormalizer</code> builder described below.
+               </para>
+
+               <informaltable frame="all">
+                       <tgroup cols="2">
+                               <thead>
+                                       <row>
+                                               <entry>Normalizer</entry>
+                                               <entry>Effect</entry>
+                                       </row>
+                               </thead>
+                               <tbody>
+                                       <row>
+                                               
<entry><code>WhitespaceCharSequenceNormalizer</code></entry>
+                                               <entry>Collapses each run of 
Unicode whitespace to a single ASCII space and
+                                                       trims the edges.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>DashCharSequenceNormalizer</code></entry>
+                                               <entry>Maps every Unicode dash 
to the ASCII hyphen-minus. The mathematical
+                                                       minus signs and the 
soft hyphen are not affected.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>QuoteCharSequenceNormalizer</code></entry>
+                                               <entry>Folds typographic single 
quotes and apostrophes to <code>'</code> and
+                                                       double quotes 
(including guillemets) to <code>"</code>.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>DigitCharSequenceNormalizer</code></entry>
+                                               <entry>Maps Unicode decimal 
digits (Arabic-Indic, Devanagari, fullwidth, ...)
+                                                       to ASCII 
<code>0</code>-<code>9</code> by their numeric value.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>EllipsisCharSequenceNormalizer</code></entry>
+                                               <entry>Expands the horizontal 
ellipsis to <code>...</code> and the two-dot
+                                                       leader to 
<code>..</code></entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>BulletCharSequenceNormalizer</code></entry>
+                                               <entry>Replaces unambiguous 
list bullets with a space; the Catalan middle dot
+                                                       is left alone.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>InvisibleCharSequenceNormalizer</code></entry>
+                                               <entry>Removes invisible format 
and bidirectional control characters (BOM,
+                                                       zero width space, bidi 
marks/overrides/isolates, ...). The zero width
+                                                       joiner and non-joiner 
and variation selectors are kept.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>NfcCharSequenceNormalizer</code></entry>
+                                               <entry>Applies Unicode 
Normalization Form C (canonical composition); a safe,
+                                                       lossless baseline for 
matching.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>NfkcCharSequenceNormalizer</code></entry>
+                                               <entry>Applies Unicode 
Normalization Form KC (compatibility composition);
+                                                       folds fullwidth forms, 
ligatures, and super/subscripts.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>CaseFoldCharSequenceNormalizer</code></entry>
+                                               <entry>Lower cases for 
case-insensitive matching, using
+                                                       
<code>Locale.ROOT</code>.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>AccentFoldCharSequenceNormalizer</code></entry>
+                                               <entry>Folds diacritics in a 
script-aware way (see below).</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>GermanUmlautCharSequenceNormalizer</code></entry>
+                                               <entry>Transliterates German 
umlauts and the eszett (a-umlaut to <code>ae</code>,
+                                                       eszett to 
<code>ss</code>; DIN 5007-2).</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>ConfusableSkeletonCharSequenceNormalizer</code></entry>
+                                               <entry>Reduces lookalike 
characters to a confusable skeleton for matching
+                                                       (UTS #39); see 
below.</entry>
+                                       </row>
+                               </tbody>
+                       </tgroup>
+               </informaltable>
+
+               <para>
+                       A single normalizer is applied directly:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer ws = 
WhitespaceCharSequenceNormalizer.getInstance();
+String clean = ws.normalize("a 　b").toString();   // "a b"
+
+String hyphen = DashCharSequenceNormalizer.getInstance()
+    .normalize("state—of–the‐art").toString(); // "state-of-the-art"]]>
+               </programlisting>
+       </section>
+
+       <section xml:id="tools.normalizer.pipeline">
+               <title>Composing a pipeline</title>
+               <para>
+                       <code>TextNormalizer</code> is a fluent builder that 
composes the rungs, in the order
+                       they are added, into a single 
<code>CharSequenceNormalizer</code>:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer pipeline = TextNormalizer.builder()
+    .nfc()
+    .caseFold()
+    .accentFold()
+    .build();
+
+String term = pipeline.normalize("CAFÉ").toString();   // "cafe"]]>
+               </programlisting>
+               <para>
+                       A conservative search-oriented chain (strip invisibles, 
NFC, collapse whitespace, fold
+                       quotes and dashes, case fold, then script-gated accent 
fold) is available directly:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer search = TextNormalizer.searchDefault();
+
+// byte order mark stripped, curly quotes folded, case and accent folded
+String t = search.normalize("“CafÉ”").toString();   // "\"cafe\""]]>
+               </programlisting>
+               <para>
+                       Any custom <code>CharSequenceNormalizer</code> can be 
inserted with
+                       <code>with(...)</code>. The <code>TextNormalizer</code> 
pipeline and the individual
+                       <code>CharSequenceNormalizer</code> implementations are 
not applied automatically by
+                       statistical OpenNLP components; callers compose them 
explicitly when preprocessing text
+                       for search or matching. The DL components described in 
the next section use a narrower,
+                       built-in subset of this machinery.
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.dl">
+               <title>Use in DL components</title>
+               <para>
+                       <code>NameFinderDL</code> and 
<code>DocumentCategorizerDL</code> share Unicode-aware text
+                       handling through <code>AbstractDL</code>. Long inputs 
are split into overlapping chunks
+                       on the full Unicode <code>White_Space</code> set 
(no-break space, ideographic space, line
+                       and paragraph separators, and the other members listed 
under
+                       <xref linkend="tools.normalizer.reference"/>), not on 
Java's six-character
+                       <code>\s</code> subset. Empty tokens from leading, 
trailing, or repeated whitespace are
+                       not produced.
+               </para>
+               <para>
+                       <code>NameFinderDL</code> additionally locates 
reconstructed entity text in the original
+                       input with a cursor-based matcher: a space in the 
reconstructed span matches zero or more
+                       Unicode whitespace code points in the source, and every 
other code point is compared
+                       case-insensitively. This replaces the previous 
regular-expression approach and correctly
+                       handles spacing copied from PDFs, the web, or non-Latin 
sources when resolving
+                       <code>Span#getCoveredText(...)</code>.
+               </para>
+               <para>
+                       Optional input folding is controlled through 
<code>InferenceOptions</code> and is
+                       <emphasis>off by default</emphasis> so existing models 
keep their prior inputs unless
+                       you opt in:
+               </para>
+               <itemizedlist>
+                       <listitem>
+                               <para>
+                                       
<code>setNormalizeWhitespace(true)</code> maps each Unicode whitespace code 
point
+                                       to a single ASCII space before 
inference. The transform is one code point to one
+                                       space, so character offsets stay 
aligned with the input.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <code>setNormalizeDashes(true)</code> 
maps each dash in the default
+                                       <code>CharClass.dashes()</code> set to 
the ASCII hyphen-minus. Mathematical minus
+                                       signs and the soft hyphen are not 
affected unless you extend the set explicitly.
+                                       This replacement is also one code point 
to one character for Basic Multilingual
+                                       Plane dashes.
+                               </para>
+                       </listitem>
+               </itemizedlist>
+               <para>
+                       Run-collapsing normalization (for example 
<code>WhitespaceCharSequenceNormalizer</code>,
+                       which collapses whitespace <emphasis>runs</emphasis> to 
a single space) is not enabled
+                       through these flags because it would shift character 
offsets. Use the
+                       <code>CharSequenceNormalizer</code> pipeline directly 
when you need that behavior on text
+                       that does not require offset-preserving span lookup. 
See also
+                       <xref linkend="tools.namefind.api.onnx"/> and
+                       <xref linkend="tools.doccat.api.onnx"/>.
+               </para>
+               <programlisting language="java">
+<![CDATA[InferenceOptions options = new InferenceOptions();
+options.setNormalizeWhitespace(true);   // opt-in: NBSP, ideographic space, 
... -> ASCII space
+options.setNormalizeDashes(true);       // opt-in: en dash, em dash, ... -> 
hyphen-minus
+
+NameFinderDL finder = new NameFinderDL(model, vocab, ids2Labels, options, 
sentenceDetector);]]>
+               </programlisting>
+       </section>
+
+       <section xml:id="tools.normalizer.accentfold">
+               <title>Diacritic folding and multilingual safety</title>
+               <para>
+                       <code>AccentFoldCharSequenceNormalizer</code> folds 
accents for search, but does so in a
+                       script-aware way that a Latin-only folding filter 
cannot. It decomposes the text, then
+                       drops nonspacing combining marks only for base 
characters whose script is configured for
+                       folding (Latin, Greek, and Cyrillic by default). 
Combining marks on other scripts are
+                       left untouched, because there they are essential 
orthography rather than decoration:
+                       dropping an Indic vowel sign or virama, an Arabic 
harakat, a Hebrew point, or a Thai
+                       vowel would change the word.
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer fold = 
AccentFoldCharSequenceNormalizer.getInstance();
+
+fold.normalize("café");                 // "cafe"   (Latin accent folded)
+fold.normalize("ά");                    // "α" (Greek alpha-with-tonos -> 
alpha)
+fold.normalize("का");              // unchanged (Devanagari is left intact)]]>
+               </programlisting>
+               <para>
+                       Atomic Latin letters that do not decompose are mapped 
to an ASCII approximation by
+                       default: for example the stroke letters and ligatures, 
eszett, and thorn
+                       (<code>ø -&gt; o</code>, <code>æ -&gt; ae</code>, 
<code>ß -&gt; ss</code>,
+                       <code>þ -&gt; th</code>). Both behaviors are 
configurable through the constructor:
+               </para>
+               <programlisting language="java">
+<![CDATA[// fold only Latin, and do not map the stroke letters
+CharSequenceNormalizer latinOnly = new AccentFoldCharSequenceNormalizer(
+    java.util.Set.of(Character.UnicodeScript.LATIN), false);]]>
+               </programlisting>
+               <para>
+                       Diacritic folding is a recall optimization, not a 
linguistically correct transform, so it
+                       is intended for a search or matching form rather than 
for display. Language-specific case
+                       and letter rules (for example German <code>DIN</code> 
umlaut expansion, or the Turkish
+                       dotless-i) are out of scope for the default folder and 
should be applied with an explicit
+                       locale upstream.
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.charclass">
+               <title>The CharClass engine and code point sets</title>
+               <para>
+                       The set-based normalizers are built on 
<code>CharClass</code>, a configurable class of
+                       Unicode code points paired with a single canonical 
replacement, backed by a
+                       <code>CodePointSet</code> with O(1) membership. You 
choose both the membership and the
+                       replacement code point with <code>CharClass.of(members, 
replacement)</code>; whitespace and
+                       dashes are the two built-in presets, and any other 
class is one more configured instance:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharClass ws = CharClass.whitespace();   // Unicode White_Space -> 
U+0020
+CharClass dash = CharClass.dashes();      // Unicode Dash (curated) -> U+002D
+
+ws.collapse("a   b");                      // "a b"          (runs -> one 
space)
+ws.trim("  hi  ");                         // "hi"
+String[] tokens = ws.split("one two"); // ["one", "two"]  (offset-aware via 
splitSpans)
+dash.normalize("a—b");                // "a-b"]]>
+               </programlisting>
+               <para>
+                       A class applies its replacement three ways, which 
differ in whether they collapse runs and
+                       whether they preserve character offsets:
+               </para>
+               <itemizedlist>
+                       <listitem>
+                               <para>
+                                       <code>normalize(text)</code> replaces 
each member one-for-one with the replacement,
+                                       so it is length- and offset-preserving; 
use it when you still need spans back into
+                                       the original text.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <code>collapse(text)</code> reduces 
each maximal run of members to a single
+                                       replacement; it changes length, so it 
is a search and match transform.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <code>collapsePreserving(text, keep, 
keepReplacement)</code> collapses runs but emits
+                                       <code>keepReplacement</code> for any 
run containing a kept code point, which is how
+                                       you squish horizontal whitespace while 
keeping line breaks.
+                               </para>
+                       </listitem>
+               </itemizedlist>
+               <para>
+                       So the replacement is your choice and the method picks 
the behavior. Folding tabs and
+                       newlines to a single newline, for example, is one 
configured class:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharClass lineFold = CharClass.of(CodePointSet.of('\n', '\t'), '\n');
+lineFold.collapse("\n\n\n\t\n");   // "\n"  (the whole run folds to one 
newline)
+
+CharClass ws = CharClass.whitespace();
+ws.collapsePreserving(text, CodePointSet.of('\n'), '\n');   // squish spaces, 
keep paragraph breaks]]>
+               </programlisting>
+               <para>
+                       When you need the normalized form together with a map 
back to the original, the
+                       <code>normalizeMapped</code> and 
<code>collapseMapped</code> variants return a
+                       <code>NormalizedText</code> that carries the offset map.
+               </para>
+               <para>
+                       A <code>CodePointSet</code> can be built explicitly, as 
a range, by union, or loaded from
+                       a user definitions file so that delimiters can be 
extended without a code change. The
+                       file is line oriented and parsed with the same cursor 
approach (no regular expression): a
+                       <code>[name]</code> line opens a section, a 
<code>#</code> begins a comment, and each
+                       remaining line is a hex code point or an inclusive 
range.
+               </para>
+               <programlisting language="none">
+<![CDATA[[whitespace]
+U+00A0          # no-break space
+U+2000-U+200A   # typographic spaces
+
+[dash]
+U+2E5D          # oblique hyphen]]>
+               </programlisting>
+               <programlisting language="java">
+<![CDATA[CodePointSet extra = CodePointSet.fromFile(path, "whitespace");
+CharClass wsPlus = CharClass.whitespace().withAdditional(extra);]]>
+               </programlisting>
+       </section>
+
+       <section xml:id="tools.normalizer.term">
+               <title>The layered term model</title>
+               <para>
+                       <code>TermAnalyzer</code> tokenizes text and gives each 
token a
+                       <emphasis>stack</emphasis> of normalization layers 
while keeping its source span. It is the
+                       offset-preserving entry point for matching and 
BM25-style search: the normalized form is
+                       what you index or query, and the span ties every layer 
back to the original text for
+                       highlighting, even when normalization changes a token's 
length. A <code>Term</code> is one
+                       token projected through an ordered chain of
+                       <code>Dimension</code>s: original, NFC, NFKC, 
whitespace, dash, case fold, accent fold,
+                       confusable fold, stem, and lemma. The order is fixed 
because the transforms do not commute
+                       (case folding then accent folding differs from the 
reverse). The original is always kept,
+                       so aggressive folding stays safe and a match on any 
layer maps back to the source through
+                       the token's <code>Span</code>.
+               </para>
+               <programlisting language="java">
+<![CDATA[TermAnalyzer analyzer = TermAnalyzer.builder()
+    .caseFold()
+    .stem(new PorterStemmer())
+    .build();
+
+Term term = analyzer.analyze("Running").get(0);
+// term.original()           -> "Running"
+// term.normalized()         -> "run"      (the final configured dimension, 
here STEM)
+// term.peel()               -> "running"  (the layer below the top, O(1))
+// term.at(Dimension.NFC)    -> computed lazily on first request, then 
cached]]>
+               </programlisting>
+               <para>
+                       Segmentation uses the <xref 
linkend="tools.tokenizer.uax29"/> word tokenizer, so the input
+                       does not need to be pre-tokenized. The dimensions named 
in the builder are computed eagerly;
+                       any other dimension is computed on first request, 
applied on top of the final form, and
+                       cached, so querying a configured layer or peeling the 
last one is O(1) and adding an
+                       unrequested dimension costs one transform. The 
character-level dimensions have built-in
+                       defaults; <code>STEM</code> and <code>LEMMA</code> 
require a
+                       <code>Stemmer</code> or <code>Lemmatizer</code> (and 
<code>LEMMA</code> a part-of-speech
+                       tag), and fail loudly if requested without them. An 
analyzer configured with a stemmer is
+                       not thread-safe, because the Snowball stemmers are 
stateful.
+               </para>
+               <para>
+                       Each dimension's transform is configurable on the 
builder. Beyond the no-argument methods
+                       that enable a dimension with its default, there are 
convenience methods for the common
+                       knobs, and a general <code>transform(dimension, 
normalizer)</code> escape hatch for any
+                       character-level dimension:
+               </para>
+               <programlisting language="java">
+<![CDATA[TermAnalyzer analyzer = TermAnalyzer.builder()
+    .whitespace(CharClass.of(CodePointSet.of('\n', '\t'), '\n')::collapse) // 
custom target/behavior
+    .caseFold(Locale.forLanguageTag("tr"))                                 // 
Turkish case rules
+    .accentFold(Set.of(Character.UnicodeScript.LATIN), false)              // 
fold only Latin
+    .maxTokenLength(255)                                                   // 
tokenizer chopping
+    .build();]]>
+               </programlisting>
+               <para>
+                       The whitespace and dash methods take any 
<code>CharSequenceNormalizer</code>, so a
+                       <code>CharClass</code> method reference 
(<code>::normalize</code> for one-for-one,
+                       <code>::collapse</code> for run-collapsing) selects 
both the fold target and the behavior.
+                       The case-fold method takes a <code>Locale</code> for 
language-specific rules such as the
+                       Turkish dotted/dotless i, and the accent-fold method 
takes the scripts to fold and whether
+                       to fold stroke letters.
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.confusables">
+               <title>Confusable (homoglyph) folding</title>
+               <para>
+                       <code>Confusables</code> reduces text to its Unicode 
confusable
+                       <emphasis>skeleton</emphasis> following
+                       <link 
xlink:href="https://www.unicode.org/reports/tr39/";>UTS #39</link>: it 
decomposes the
+                       text, replaces each code point with its prototype, and 
decomposes again. Two strings are
+                       confusable exactly when their skeletons are equal, 
which catches spoofing where Cyrillic or
+                       Greek letters imitate Latin ones.
+               </para>
+               <programlisting language="java">
+<![CDATA[// "paypal" with Cyrillic a's looks identical but is a different 
code-point sequence
+Confusables.confusable("paypal", spoofed);   // true
+Confusables.skeleton("paypal");              // a matching key, not readable 
text]]>
+               </programlisting>
+               <para>
+                       The skeleton changes length and offsets, so like accent 
folding it is a derived,
+                       matching-only form. It is also available as
+                       <code>ConfusableSkeletonCharSequenceNormalizer</code> 
and as the
+                       <code>CONFUSABLE_FOLD</code> term dimension. The 
mapping comes from the bundled Unicode
+                       security data file <code>confusables.txt</code>.
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.language">
+               <title>Per-language profiles</title>
+               <para>
+                       <code>NormalizationProfiles</code> selects per-language 
settings the same way OpenNLP
+                       already selects a Snowball stemmer by language: ask for 
a language, or detect it with a
+                       <code>LanguageDetector</code> when it is unspecified. 
Each
+                       <code>NormalizationProfile</code> pairs a language with 
its Snowball stemmer and the
+                       diacritic fold appropriate for that language, and 
builds a search-oriented
+                       <code>TermAnalyzer</code>.
+               </para>
+               <programlisting language="java">
+<![CDATA[NormalizationProfile german = 
NormalizationProfiles.forLanguage("de").orElseThrow();
+TermAnalyzer analyzer = german.searchAnalyzer();   // NFC, case fold, German 
fold, German stemmer
+// "Mueller" and "Müller" both reduce to the same search term
+
+// detect the language when it is not known
+NormalizationProfiles.detect(text, 
languageDetector).map(NormalizationProfile::searchAnalyzer);]]>
+               </programlisting>
+               <para>
+                       The diacritic fold is the generic accent fold for 
English and the major Romance languages,
+                       the German-specific fold (a-umlaut to <code>ae</code>, 
eszett to <code>ss</code>, following
+                       DIN 5007-2) for German, and none for the Nordic 
languages and non-Latin scripts, where
+                       folding distinct letters is language-wrong. As stated in
+                       <xref linkend="tools.normalizer.accentfold"/>, this is 
a search-recall choice, not
+                       linguistic correctness; a caller that wants different 
behavior builds a
+                       <code>TermAnalyzer</code> directly.
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.reference">
+               <title>Reference data</title>
+               <para>
+                       The underlying Unicode data is also available directly 
as immutable reference tables,
+                       with O(1) membership tests that match the Unicode 
standard:
+               </para>
+               <itemizedlist>
+                       <listitem>
+                               <para>
+                                       <code>UnicodeWhitespace</code> lists 
the 25 characters carrying the
+                                       <code>White_Space</code> property, plus 
the related look-alike format characters
+                                       (zero width space, byte order mark, 
...) that are <emphasis>not</emphasis>
+                                       whitespace. It exposes 
<code>isWhitespace(int)</code>,
+                                       <code>byCodePoint(int)</code>, and 
helpers for the line breaks and the
+                                       non-breaking spaces.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <code>UnicodeDash</code> lists every 
code point carrying the <code>Dash</code>
+                                       property, distinguishing the 
mathematical minus signs that are excluded from the
+                                       default normalization set.
+                               </para>
+                       </listitem>
+               </itemizedlist>
+       </section>
+
+</chapter>
diff --git a/opennlp-docs/src/docbkx/opennlp.xml 
b/opennlp-docs/src/docbkx/opennlp.xml
index 67eb1edf1..843bfbc9b 100644
--- a/opennlp-docs/src/docbkx/opennlp.xml
+++ b/opennlp-docs/src/docbkx/opennlp.xml
@@ -101,6 +101,7 @@ under the License.
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./langdetect.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./sentdetect.xml"/>
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./tokenizer.xml" />
+       <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./normalizer.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./stopword.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./namefinder.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./doccat.xml" />
diff --git a/opennlp-docs/src/docbkx/tokenizer.xml 
b/opennlp-docs/src/docbkx/tokenizer.xml
index b6fb7b074..7bb3356de 100644
--- a/opennlp-docs/src/docbkx/tokenizer.xml
+++ b/opennlp-docs/src/docbkx/tokenizer.xml
@@ -23,7 +23,16 @@
                        The OpenNLP Tokenizers segment an input character 
sequence into
                        tokens. Tokens are usually
                        words, punctuation, numbers, etc.
-
+               </para>
+               <para>
+                       The statistical tokenizers in this chapter assume 
conventional whitespace-separated training
+                       and test data. When input contains Unicode spacing or 
dash variants (no-break space,
+                       ideographic space, en dash, and similar characters from 
PDFs or the web), use the
+                       Unicode-aware preprocessing described in <xref 
linkend="tools.normalizer"/>. The DL
+                       components apply that machinery automatically for 
document chunking; see
+                       <xref linkend="tools.normalizer.dl"/>.
+               </para>
+               <para>
                        <screen>
 <![CDATA[Pierre Vinken, 61 years old, will join the board as a nonexecutive 
director Nov. 29.
 Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
@@ -443,4 +452,84 @@ DetokenizationDictionary dict = new 
DetokenizationDictionary(tokens, operations)
                        </para>
                </section>
        </section>
+
+       <section xml:id="tools.tokenizer.uax29">
+               <title>Unicode Word Segmentation (UAX #29)</title>
+               <para>
+                       The package <code>opennlp.tools.tokenize.uax29</code> 
provides a tokenizer that follows the
+                       Unicode Text Segmentation algorithm
+                       (<link 
xlink:href="https://www.unicode.org/reports/tr29/";>UAX #29</link>), word 
boundary
+                       rules WB1 through WB999. It is rule based and needs no 
trained model, it works directly over
+                       a <code>CharSequence</code>, and it reports character 
offsets so the original text is
+                       preserved for downstream processing such as the 
normalization described in
+                       <xref linkend="tools.normalizer"/>. The boundary data 
comes from the bundled Unicode
+                       Character Database (currently Unicode 17.0) and the 
implementation passes the official
+                       <code>WordBreakTest</code> conformance suite for that 
release.
+               </para>
+               <section xml:id="tools.tokenizer.uax29.segmenter">
+                       <title>Word Segmenter</title>
+                       <para>
+                               <code>WordSegmenter</code> finds the word 
boundaries. It is a single forward cursor pass
+                               with constant-time property look-ups and no 
regular expression. Every segment is
+                               reported, including whitespace and punctuation 
runs, so the segments are contiguous and
+                               together cover the whole text.
+                               <programlisting language="java">
+<![CDATA[int[] boundaries = WordSegmenter.boundaries("The quick brown fox.");
+List<Span> segments = WordSegmenter.segments("The quick brown fox.");]]>
+                               </programlisting>
+                               For allocation-free processing of large inputs, 
stream the segments to a callback instead
+                               of collecting them.
+                               <programlisting language="java">
+<![CDATA[WordSegmenter.forEachSegment("The quick brown fox.", (start, end) -> {
+  // handle the segment [start, end)
+});]]>
+                               </programlisting>
+                       </para>
+               </section>
+               <section xml:id="tools.tokenizer.uax29.tokenizer">
+                       <title>Word Tokenizer</title>
+                       <para>
+                               <code>WordTokenizer</code> builds on the 
segmenter. It keeps the segments that are words
+                               (letters, digits, ideographs, kana, Hangul, a 
Southeast Asian script, or emoji), drops
+                               whitespace and punctuation, and classifies each 
token. It implements the standard
+                               <code>Tokenizer</code> interface, so it can be 
used wherever a tokenizer is expected.
+                               <programlisting language="java">
+<![CDATA[Tokenizer tokenizer = new WordTokenizer();
+String[] tokens = tokenizer.tokenize("The quick brown fox.");
+Span[] spans = tokenizer.tokenizePos("The quick brown fox.");]]>
+                               </programlisting>
+                               The tokens array contains "The", "quick", 
"brown", and "fox"; the trailing period and the
+                               spaces are dropped. The 
<code>tokenizeTyped</code> method additionally returns the
+                               category of each token as a 
<code>WordType</code>.
+                               <programlisting language="java">
+<![CDATA[WordTokenizer wordTokenizer = new WordTokenizer();
+for (WordToken token : wordTokenizer.tokenizeTyped("OpenNLP 3.0")) {
+  System.out.println(token.text("OpenNLP 3.0") + " : " + token.type());
+}
+// OpenNLP : ALPHANUMERIC
+// 3.0     : NUMERIC]]>
+                               </programlisting>
+                               The categories are <code>ALPHANUMERIC</code>, 
<code>NUMERIC</code>,
+                               <code>IDEOGRAPHIC</code>, 
<code>HIRAGANA</code>, <code>KATAKANA</code>,
+                               <code>HANGUL</code>, 
<code>SOUTHEAST_ASIAN</code>, and <code>EMOJI</code>.
+                       </para>
+                       <para>
+                               A streaming overload reports each token to a 
handler with no per-token allocation, which
+                               is the fastest option when the tokens are 
consumed on the fly.
+                               <programlisting language="java">
+<![CDATA[WordTokenizer wordTokenizer = new WordTokenizer();
+wordTokenizer.tokenize("The quick brown fox.", (start, end, type) -> {
+  // handle the token [start, end) of the given WordType
+});]]>
+                               </programlisting>
+                               A token longer than the maximum token length is 
emitted as consecutive pieces without
+                               splitting a surrogate pair. The maximum 
defaults to
+                               
<code>WordTokenizer.DEFAULT_MAX_TOKEN_LENGTH</code> and can be set through the
+                               constructor.
+                               <programlisting language="java">
+<![CDATA[WordTokenizer wordTokenizer = new WordTokenizer(64);]]>
+                               </programlisting>
+                       </para>
+               </section>
+       </section>
 </chapter>

(opennlp) 01/01: OPENNLP-1850 Document Unicode normalization and the UAX #29 tokenizer

Reply via email to