(opennlp) 05/05: OPENNLP-1850 - Document text normalization in the manual

kristian Thu, 18 Jun 2026 22:14:12 -0700

This is an automated email from the ASF dual-hosted git repository.

krickert pushed a commit to branch OPENNLP-1850_Whitespace-UTF-Normalizae
in repository https://gitbox.apache.org/repos/asf/opennlp.git


commit 858fb7f571cecb57eea870e5f74494bdcbe90bc8
Author: Kristian Rickert <[email protected]>
AuthorDate: Thu Jun 18 23:12:04 2026 -0400

    OPENNLP-1850 - Document text normalization in the manual
    
    Add a Text Normalization chapter to the developer manual covering the
    normalizer family, the TextNormalizer pipeline, script-gated diacritic 
folding
    and its multilingual safety, the CharClass engine and user-defined code 
point
    sets, offset-preserving analysis, and the Unicode reference data.
---
 opennlp-docs/src/docbkx/normalizer.xml | 317 +++++++++++++++++++++++++++++++++
 opennlp-docs/src/docbkx/opennlp.xml    |   1 +
 2 files changed, 318 insertions(+)

diff --git a/opennlp-docs/src/docbkx/normalizer.xml 
b/opennlp-docs/src/docbkx/normalizer.xml
new file mode 100644
index 000000000..d14177db1
--- /dev/null
+++ b/opennlp-docs/src/docbkx/normalizer.xml
@@ -0,0 +1,317 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN"
+"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd";[
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more 
contributor
+       license agreements. See the NOTICE file distributed with this work for 
additional
+       information regarding copyright ownership. The ASF licenses this file to
+       you under the Apache License, Version 2.0 (the "License"); you may not 
use
+       this file except in compliance with the License. You may obtain a copy 
of
+       the License at http://www.apache.org/licenses/LICENSE-2.0 Unless 
required
+       by applicable law or agreed to in writing, software distributed under 
the
+       License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR 
CONDITIONS
+       OF ANY KIND, either express or implied. See the License for the specific
+       language governing permissions and limitations under the License. -->
+
+<chapter xml:id="tools.normalizer">
+
+       <title>Text Normalization</title>
+
+       <section xml:id="tools.normalizer.introduction">
+               <title>Introduction</title>
+               <para>
+                       The package <code>opennlp.tools.util.normalizer</code> 
provides Unicode-aware text
+                       normalization for matching, search, and tokenization 
preprocessing. It cleans up the
+                       kinds of inconsistency that real text carries when it 
is copied from the web, PDFs,
+                       office documents, or multilingual sources: spacing that 
is not an ordinary space, the
+                       many dash and quotation variants, decomposed versus 
precomposed accents, non-ASCII
+                       digits, and invisible control characters.
+               </para>
+               <para>
+                       The implementation follows three principles:
+               </para>
+               <itemizedlist>
+                       <listitem>
+                               <para>
+                                       <emphasis 
role="bold">Standards-sourced.</emphasis> Membership sets come from the
+                                       Unicode Character Database (for example 
the <code>White_Space</code> and
+                                       <code>Dash</code> properties), not from 
the JVM's locale-dependent or quirky
+                                       character predicates. The library never 
relies on
+                                       <code>Character.isWhitespace</code>, 
which disagrees with the Unicode standard.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <emphasis role="bold">Cursor-based, no 
regular expressions.</emphasis> Every
+                                       operation is a single forward pass over 
the input that tests membership in O(1)
+                                       and advances by code point. This avoids 
the allocation and the catastrophic
+                                       backtracking (ReDoS) risk of regular 
expressions, and it correctly recognizes
+                                       Unicode characters that Java's 
<code>\s</code> does not.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <emphasis 
role="bold">Offset-preserving.</emphasis> The original text is always
+                                       the source of truth. Normalization 
produces a derived form for matching while the
+                                       original character offsets are kept, so 
a search hit can be reported and
+                                       highlighted against the source even 
when the normalized form has a different
+                                       length.
+                               </para>
+                       </listitem>
+               </itemizedlist>
+               <para>
+                       There are two layers. The 
<code>CharSequenceNormalizer</code> family offers ready-made,
+                       composable normalizers; the <code>CharClass</code> 
engine is the low-level, configurable
+                       building block they are made of.
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.normalizers">
+               <title>The normalizer family</title>
+               <para>
+                       Each normalizer implements the existing
+                       
<code>opennlp.tools.util.normalizer.CharSequenceNormalizer</code> interface
+                       (<code>CharSequence normalize(CharSequence)</code>) and 
is a shared, stateless singleton
+                       obtained through <code>getInstance()</code>. They can 
therefore be combined with the
+                       existing <code>AggregateCharSequenceNormalizer</code>, 
or with the
+                       <code>TextNormalizer</code> builder described below.
+               </para>
+
+               <informaltable frame="all">
+                       <tgroup cols="2">
+                               <thead>
+                                       <row>
+                                               <entry>Normalizer</entry>
+                                               <entry>Effect</entry>
+                                       </row>
+                               </thead>
+                               <tbody>
+                                       <row>
+                                               
<entry><code>WhitespaceCharSequenceNormalizer</code></entry>
+                                               <entry>Collapses each run of 
Unicode whitespace to a single ASCII space and
+                                                       trims the edges.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>DashCharSequenceNormalizer</code></entry>
+                                               <entry>Maps every Unicode dash 
to the ASCII hyphen-minus. The mathematical
+                                                       minus signs and the 
soft hyphen are not affected.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>QuoteCharSequenceNormalizer</code></entry>
+                                               <entry>Folds typographic single 
quotes and apostrophes to <code>'</code> and
+                                                       double quotes 
(including guillemets) to <code>"</code>.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>DigitCharSequenceNormalizer</code></entry>
+                                               <entry>Maps Unicode decimal 
digits (Arabic-Indic, Devanagari, fullwidth, ...)
+                                                       to ASCII 
<code>0</code>-<code>9</code> by their numeric value.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>EllipsisCharSequenceNormalizer</code></entry>
+                                               <entry>Expands the horizontal 
ellipsis to <code>...</code> and the two-dot
+                                                       leader to 
<code>..</code></entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>BulletCharSequenceNormalizer</code></entry>
+                                               <entry>Replaces unambiguous 
list bullets with a space; the Catalan middle dot
+                                                       is left alone.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>InvisibleCharSequenceNormalizer</code></entry>
+                                               <entry>Removes invisible format 
and bidirectional control characters (BOM,
+                                                       zero width space, bidi 
marks/overrides/isolates, ...). The zero width
+                                                       joiner and non-joiner 
and variation selectors are kept.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>NfcCharSequenceNormalizer</code></entry>
+                                               <entry>Applies Unicode 
Normalization Form C (canonical composition); a safe,
+                                                       lossless baseline for 
matching.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>NfkcCharSequenceNormalizer</code></entry>
+                                               <entry>Applies Unicode 
Normalization Form KC (compatibility composition);
+                                                       folds fullwidth forms, 
ligatures, and super/subscripts.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>CaseFoldCharSequenceNormalizer</code></entry>
+                                               <entry>Lower cases for 
case-insensitive matching, using
+                                                       
<code>Locale.ROOT</code>.</entry>
+                                       </row>
+                                       <row>
+                                               
<entry><code>AccentFoldCharSequenceNormalizer</code></entry>
+                                               <entry>Folds diacritics in a 
script-aware way (see below).</entry>
+                                       </row>
+                               </tbody>
+                       </tgroup>
+               </informaltable>
+
+               <para>
+                       A single normalizer is applied directly:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer ws = 
WhitespaceCharSequenceNormalizer.getInstance();
+String clean = ws.normalize("a 　b").toString();   // "a b"
+
+String hyphen = DashCharSequenceNormalizer.getInstance()
+    .normalize("state—of–the‐art").toString(); // "state-of-the-art"]]>
+               </programlisting>
+       </section>
+
+       <section xml:id="tools.normalizer.pipeline">
+               <title>Composing a pipeline</title>
+               <para>
+                       <code>TextNormalizer</code> is a fluent builder that 
composes the rungs, in the order
+                       they are added, into a single 
<code>CharSequenceNormalizer</code>:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer pipeline = TextNormalizer.builder()
+    .nfc()
+    .caseFold()
+    .accentFold()
+    .build();
+
+String term = pipeline.normalize("CAFÉ").toString();   // "cafe"]]>
+               </programlisting>
+               <para>
+                       A conservative search-oriented chain (strip invisibles, 
NFC, collapse whitespace, fold
+                       quotes and dashes, case fold, then script-gated accent 
fold) is available directly:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer search = TextNormalizer.searchDefault();
+
+// byte order mark stripped, curly quotes folded, case and accent folded
+String t = search.normalize("“CafÉ”").toString();   // "\"cafe\""]]>
+               </programlisting>
+               <para>
+                       Any custom <code>CharSequenceNormalizer</code> can be 
inserted with
+                       <code>with(...)</code>. None of these normalizers is 
applied automatically by any OpenNLP
+                       component; normalization is always an explicit, opt-in 
choice.
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.accentfold">
+               <title>Diacritic folding and multilingual safety</title>
+               <para>
+                       <code>AccentFoldCharSequenceNormalizer</code> folds 
accents for search, but does so in a
+                       script-aware way that a Latin-only folding filter 
cannot. It decomposes the text, then
+                       drops nonspacing combining marks only for base 
characters whose script is configured for
+                       folding (Latin, Greek, and Cyrillic by default). 
Combining marks on other scripts are
+                       left untouched, because there they are essential 
orthography rather than decoration:
+                       dropping an Indic vowel sign or virama, an Arabic 
harakat, a Hebrew point, or a Thai
+                       vowel would change the word.
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer fold = 
AccentFoldCharSequenceNormalizer.getInstance();
+
+fold.normalize("café");                 // "cafe"   (Latin accent folded)
+fold.normalize("ά");                    // "α" (Greek alpha-with-tonos -> 
alpha)
+fold.normalize("का");              // unchanged (Devanagari is left intact)]]>
+               </programlisting>
+               <para>
+                       Atomic Latin letters that do not decompose are mapped 
to an ASCII approximation by
+                       default: for example the stroke letters and ligatures, 
eszett, and thorn
+                       (<code>ø -&gt; o</code>, <code>æ -&gt; ae</code>, 
<code>ß -&gt; ss</code>,
+                       <code>þ -&gt; th</code>). Both behaviors are 
configurable through the constructor:
+               </para>
+               <programlisting language="java">
+<![CDATA[// fold only Latin, and do not map the stroke letters
+CharSequenceNormalizer latinOnly = new AccentFoldCharSequenceNormalizer(
+    java.util.Set.of(Character.UnicodeScript.LATIN), false);]]>
+               </programlisting>
+               <para>
+                       Diacritic folding is a recall optimization, not a 
linguistically correct transform, so it
+                       is intended for a search or matching form rather than 
for display. Language-specific case
+                       and letter rules (for example German <code>DIN</code> 
umlaut expansion, or the Turkish
+                       dotless-i) are out of scope for the default folder and 
should be applied with an explicit
+                       locale upstream.
+               </para>
+       </section>
+
+       <section xml:id="tools.normalizer.charclass">
+               <title>The CharClass engine and code point sets</title>
+               <para>
+                       The set-based normalizers are built on 
<code>CharClass</code>, a configurable class of
+                       Unicode code points paired with a single canonical 
replacement, backed by a
+                       <code>CodePointSet</code> with O(1) membership. 
Whitespace and dashes are the two built-in
+                       presets, and any other class is one more configured 
instance:
+               </para>
+               <programlisting language="java">
+<![CDATA[CharClass ws = CharClass.whitespace();   // Unicode White_Space -> 
U+0020
+CharClass dash = CharClass.dashes();      // Unicode Dash (curated) -> U+002D
+
+ws.collapse("a   b");                      // "a b"          (runs -> one 
space)
+ws.trim("  hi  ");                         // "hi"
+String[] tokens = ws.split("one two"); // ["one", "two"]  (offset-aware via 
splitSpans)
+dash.normalize("a—b");                // "a-b"]]>
+               </programlisting>
+               <para>
+                       A <code>CodePointSet</code> can be built explicitly, as 
a range, by union, or loaded from
+                       a user definitions file so that delimiters can be 
extended without a code change. The
+                       file is line oriented and parsed with the same cursor 
approach (no regular expression): a
+                       <code>[name]</code> line opens a section, a 
<code>#</code> begins a comment, and each
+                       remaining line is a hex code point or an inclusive 
range.
+               </para>
+               <programlisting language="none">
+<![CDATA[[whitespace]
+U+00A0          # no-break space
+U+2000-U+200A   # typographic spaces
+
+[dash]
+U+2E5D          # oblique hyphen]]>
+               </programlisting>
+               <programlisting language="java">
+<![CDATA[CodePointSet extra = CodePointSet.fromFile(path, "whitespace");
+CharClass wsPlus = CharClass.whitespace().withAdditional(extra);]]>
+               </programlisting>
+       </section>
+
+       <section xml:id="tools.normalizer.analyzer">
+               <title>Offset-preserving analysis for search</title>
+               <para>
+                       <code>TextAnalyzer</code> tokenizes text and normalizes 
each token while keeping every
+                       token's source span. This is the building block for 
BM25-style matching: the normalized
+                       term is what you index or query, and the 
<code>Span</code> ties it back to the original
+                       text for highlighting, even when normalization changes 
a token's length.
+               </para>
+               <programlisting language="java">
+<![CDATA[CharSequenceNormalizer perToken = 
TextNormalizer.builder().caseFold().accentFold().build();
+TextAnalyzer analyzer = TextAnalyzer.whitespace(perToken);
+
+for (AnalyzedToken token : analyzer.analyze("Café au lait")) {
+  // token.span()       -> character span in the original text
+  // token.original()   -> the raw token, e.g. "Café"
+  // token.normalized() -> the search term, e.g. "cafe"
+}
+
+List<String> terms = analyzer.terms("Café au lait");   // ["cafe", "au", 
"lait"]]]>
+               </programlisting>
+       </section>
+
+       <section xml:id="tools.normalizer.reference">
+               <title>Reference data</title>
+               <para>
+                       The underlying Unicode data is also available directly 
as immutable reference tables,
+                       with O(1) membership tests that match the Unicode 
standard:
+               </para>
+               <itemizedlist>
+                       <listitem>
+                               <para>
+                                       <code>UnicodeWhitespace</code> lists 
the 25 characters carrying the
+                                       <code>White_Space</code> property, plus 
the related look-alike format characters
+                                       (zero width space, byte order mark, 
...) that are <emphasis>not</emphasis>
+                                       whitespace. It exposes 
<code>isWhitespace(int)</code>,
+                                       <code>byCodePoint(int)</code>, and 
helpers for the line breaks and the
+                                       non-breaking spaces.
+                               </para>
+                       </listitem>
+                       <listitem>
+                               <para>
+                                       <code>UnicodeDash</code> lists every 
code point carrying the <code>Dash</code>
+                                       property, distinguishing the 
mathematical minus signs that are excluded from the
+                                       default normalization set.
+                               </para>
+                       </listitem>
+               </itemizedlist>
+       </section>
+
+</chapter>
diff --git a/opennlp-docs/src/docbkx/opennlp.xml 
b/opennlp-docs/src/docbkx/opennlp.xml
index 67eb1edf1..843bfbc9b 100644
--- a/opennlp-docs/src/docbkx/opennlp.xml
+++ b/opennlp-docs/src/docbkx/opennlp.xml
@@ -101,6 +101,7 @@ under the License.
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./langdetect.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./sentdetect.xml"/>
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./tokenizer.xml" />
+       <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./normalizer.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./stopword.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./namefinder.xml" />
        <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"; 
href="./doccat.xml" />

(opennlp) 05/05: OPENNLP-1850 - Document text normalization in the manual

Reply via email to