This is an automated email from the ASF dual-hosted git repository. krickert pushed a commit to branch OPENNLP-1850-4-docs in repository https://gitbox.apache.org/repos/asf/opennlp.git
commit 3037db7b21d1acfe96102f5be3baeb8e22afa5e9 Author: Kristian Rickert <[email protected]> AuthorDate: Sat Jun 20 08:06:57 2026 -0400 OPENNLP-1850 Document Unicode normalization and the UAX #29 tokenizer Adds the normalizer manual chapter and updates the tokenizer, doccat, namefinder, and introduction chapters (and the master opennlp.xml) to cover the new normalization pipeline and word tokenizer. --- opennlp-docs/src/docbkx/doccat.xml | 18 ++ opennlp-docs/src/docbkx/introduction.xml | 3 +- opennlp-docs/src/docbkx/namefinder.xml | 27 +- opennlp-docs/src/docbkx/normalizer.xml | 532 +++++++++++++++++++++++++++++++ opennlp-docs/src/docbkx/opennlp.xml | 1 + opennlp-docs/src/docbkx/tokenizer.xml | 91 +++++- 6 files changed, 669 insertions(+), 3 deletions(-) diff --git a/opennlp-docs/src/docbkx/doccat.xml b/opennlp-docs/src/docbkx/doccat.xml index 7d03f1c2a..e12186ec4 100644 --- a/opennlp-docs/src/docbkx/doccat.xml +++ b/opennlp-docs/src/docbkx/doccat.xml @@ -171,6 +171,24 @@ String category = myCategorizer.getBestCategory(outcomes);]]> </programlisting> For additional examples, refer to the <code>DocumentCategorizerDLEval</code> class. </para> + <para> + Like <code>NameFinderDL</code>, long input is split into overlapping chunks on the full + Unicode <code>White_Space</code> set rather than Java's <code>\s</code>, so text copied + from PDFs, the web, or multilingual sources tokenizes consistently. Optional + preprocessing through <code>InferenceOptions</code> is off by default: + <code>setNormalizeWhitespace(true)</code> maps each Unicode whitespace code point to an + ASCII space, and <code>setNormalizeDashes(true)</code> maps Unicode dashes to the ASCII + hyphen-minus. Both are one-to-one replacements that preserve character offsets. See + <xref linkend="tools.normalizer"/> for the shared <code>CharClass</code> engine and the + full normalization library. + </para> + <programlisting language="java"> +<![CDATA[InferenceOptions options = new InferenceOptions(); +options.setNormalizeWhitespace(true); +options.setNormalizeDashes(true); +DocumentCategorizerDL categorizer = new DocumentCategorizerDL( + model, vocab, categories, scoringStrategy, options);]]> + </programlisting> </section> </section> diff --git a/opennlp-docs/src/docbkx/introduction.xml b/opennlp-docs/src/docbkx/introduction.xml index e7ac5c7c3..82e53cccb 100644 --- a/opennlp-docs/src/docbkx/introduction.xml +++ b/opennlp-docs/src/docbkx/introduction.xml @@ -303,7 +303,8 @@ Arguments description: and <xref linkend="tools.doccat">Document Categorizer</xref>. This allows models trained by other frameworks such as PyTorch and Tensorflow to be used by OpenNLP. The documentation for each of the OpenNLP components that supports ONNX models describes how to - use ONNX models for inference. + use ONNX models for inference. DL inference uses Unicode-aware text chunking and + optional input normalization; see <xref linkend="tools.normalizer.dl"/>. </para> <note> <para> diff --git a/opennlp-docs/src/docbkx/namefinder.xml b/opennlp-docs/src/docbkx/namefinder.xml index ff695d898..6c2c759c0 100644 --- a/opennlp-docs/src/docbkx/namefinder.xml +++ b/opennlp-docs/src/docbkx/namefinder.xml @@ -157,11 +157,36 @@ Span[] nameSpans = nameFinder.find(sentence);]]> File vocab = new File("/path/to/vocab.txt"); Map<Integer, String> categories = new HashMap<>(); String[] tokens = new String[]{"George", "Washington", "was", "president", "of", "the", "United", "States", "."}; -NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, getIds2Labels()); +NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, getIds2Labels(), sentenceDetector); Span[] spans = nameFinderDL.find(tokens);]]> </programlisting> For additional examples, refer to the <code>NameFinderDLEval</code> class. </para> + <para> + Long input text is split into overlapping chunks on the full Unicode + <code>White_Space</code> set before WordPiece tokenization, so spacing such as a + no-break space or the CJK ideographic space is recognized as a delimiter. After + inference, reconstructed entity text is matched back to the caller's original input + with a Unicode-aware cursor scan (not a regular expression), so + <code>Span#getCoveredText(...)</code> returns the source text even when WordPiece + rejoins sub-tokens with spaces or when the source uses non-ASCII whitespace between + tokens. + </para> + <para> + Optional preprocessing of the joined input text is available through + <code>InferenceOptions</code> and is off by default: + <code>setNormalizeWhitespace(true)</code> folds each Unicode whitespace character to + an ASCII space, and <code>setNormalizeDashes(true)</code> folds Unicode dashes to the + ASCII hyphen-minus. Both transforms are one code point to one character and preserve + offsets. Full details, the underlying <code>CharClass</code> engine, and the broader + normalization pipeline are documented in <xref linkend="tools.normalizer"/>. + </para> + <programlisting language="java"> +<![CDATA[InferenceOptions options = new InferenceOptions(); +options.setNormalizeWhitespace(true); +options.setNormalizeDashes(true); +NameFinderDL finder = new NameFinderDL(model, vocab, ids2Labels, options, sentenceDetector);]]> + </programlisting> </section> </section> </section> diff --git a/opennlp-docs/src/docbkx/normalizer.xml b/opennlp-docs/src/docbkx/normalizer.xml new file mode 100644 index 000000000..55376f538 --- /dev/null +++ b/opennlp-docs/src/docbkx/normalizer.xml @@ -0,0 +1,532 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN" +"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd"[ +]> +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor + license agreements. See the NOTICE file distributed with this work for additional + information regarding copyright ownership. The ASF licenses this file to + you under the Apache License, Version 2.0 (the "License"); you may not use + this file except in compliance with the License. You may obtain a copy of + the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required + by applicable law or agreed to in writing, software distributed under the + License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS + OF ANY KIND, either express or implied. See the License for the specific + language governing permissions and limitations under the License. --> + +<chapter xml:id="tools.normalizer"> + + <title>Text Normalization</title> + + <section xml:id="tools.normalizer.introduction"> + <title>Introduction</title> + <para> + The package <code>opennlp.tools.util.normalizer</code> provides Unicode-aware text + normalization for matching, search, and tokenization preprocessing. It cleans up the + kinds of inconsistency that real text carries when it is copied from the web, PDFs, + office documents, or multilingual sources: spacing that is not an ordinary space, the + many dash and quotation variants, decomposed versus precomposed accents, non-ASCII + digits, and invisible control characters. + </para> + <para> + The implementation follows three principles: + </para> + <itemizedlist> + <listitem> + <para> + <emphasis role="bold">Standards-sourced.</emphasis> Membership sets come from the + Unicode Character Database (for example the <code>White_Space</code> and + <code>Dash</code> properties), not from the JVM's locale-dependent or quirky + character predicates. The library never relies on + <code>Character.isWhitespace</code>, which disagrees with the Unicode standard. + </para> + </listitem> + <listitem> + <para> + <emphasis role="bold">Cursor-based, no regular expressions.</emphasis> Every + operation is a single forward pass over the input that tests membership in O(1) + and advances by code point. This avoids the allocation and the catastrophic + backtracking (ReDoS) risk of regular expressions, and it correctly recognizes + Unicode characters that Java's <code>\s</code> does not. + </para> + </listitem> + <listitem> + <para> + <emphasis role="bold">Offset-preserving.</emphasis> The original text is always + the source of truth. Normalization produces a derived form for matching while the + original character offsets are kept, so a search hit can be reported and + highlighted against the source even when the normalized form has a different + length. + </para> + </listitem> + </itemizedlist> + <para> + Two engines underpin everything: the <code>CharSequenceNormalizer</code> family offers + ready-made, composable normalizers, and the <code>CharClass</code> engine is the low-level, + configurable building block they are made of. Built on these are three higher-level + features documented below: a layered term model that projects a token through a + configurable stack of transforms while keeping every intermediate form (see + <xref linkend="tools.normalizer.term"/>), per-language profiles that select the transforms + appropriate to a language (see <xref linkend="tools.normalizer.language"/>), and confusable + folding that reduces lookalike characters for matching (see + <xref linkend="tools.normalizer.confusables"/>). + </para> + </section> + + <section xml:id="tools.normalizer.normalizers"> + <title>The normalizer family</title> + <para> + Each normalizer implements the existing + <code>opennlp.tools.util.normalizer.CharSequenceNormalizer</code> interface + (<code>CharSequence normalize(CharSequence)</code>) and is a shared, stateless singleton + obtained through <code>getInstance()</code>. They can therefore be combined with the + existing <code>AggregateCharSequenceNormalizer</code>, or with the + <code>TextNormalizer</code> builder described below. + </para> + + <informaltable frame="all"> + <tgroup cols="2"> + <thead> + <row> + <entry>Normalizer</entry> + <entry>Effect</entry> + </row> + </thead> + <tbody> + <row> + <entry><code>WhitespaceCharSequenceNormalizer</code></entry> + <entry>Collapses each run of Unicode whitespace to a single ASCII space and + trims the edges.</entry> + </row> + <row> + <entry><code>DashCharSequenceNormalizer</code></entry> + <entry>Maps every Unicode dash to the ASCII hyphen-minus. The mathematical + minus signs and the soft hyphen are not affected.</entry> + </row> + <row> + <entry><code>QuoteCharSequenceNormalizer</code></entry> + <entry>Folds typographic single quotes and apostrophes to <code>'</code> and + double quotes (including guillemets) to <code>"</code>.</entry> + </row> + <row> + <entry><code>DigitCharSequenceNormalizer</code></entry> + <entry>Maps Unicode decimal digits (Arabic-Indic, Devanagari, fullwidth, ...) + to ASCII <code>0</code>-<code>9</code> by their numeric value.</entry> + </row> + <row> + <entry><code>EllipsisCharSequenceNormalizer</code></entry> + <entry>Expands the horizontal ellipsis to <code>...</code> and the two-dot + leader to <code>..</code></entry> + </row> + <row> + <entry><code>BulletCharSequenceNormalizer</code></entry> + <entry>Replaces unambiguous list bullets with a space; the Catalan middle dot + is left alone.</entry> + </row> + <row> + <entry><code>InvisibleCharSequenceNormalizer</code></entry> + <entry>Removes invisible format and bidirectional control characters (BOM, + zero width space, bidi marks/overrides/isolates, ...). The zero width + joiner and non-joiner and variation selectors are kept.</entry> + </row> + <row> + <entry><code>NfcCharSequenceNormalizer</code></entry> + <entry>Applies Unicode Normalization Form C (canonical composition); a safe, + lossless baseline for matching.</entry> + </row> + <row> + <entry><code>NfkcCharSequenceNormalizer</code></entry> + <entry>Applies Unicode Normalization Form KC (compatibility composition); + folds fullwidth forms, ligatures, and super/subscripts.</entry> + </row> + <row> + <entry><code>CaseFoldCharSequenceNormalizer</code></entry> + <entry>Lower cases for case-insensitive matching, using + <code>Locale.ROOT</code>.</entry> + </row> + <row> + <entry><code>AccentFoldCharSequenceNormalizer</code></entry> + <entry>Folds diacritics in a script-aware way (see below).</entry> + </row> + <row> + <entry><code>GermanUmlautCharSequenceNormalizer</code></entry> + <entry>Transliterates German umlauts and the eszett (a-umlaut to <code>ae</code>, + eszett to <code>ss</code>; DIN 5007-2).</entry> + </row> + <row> + <entry><code>ConfusableSkeletonCharSequenceNormalizer</code></entry> + <entry>Reduces lookalike characters to a confusable skeleton for matching + (UTS #39); see below.</entry> + </row> + </tbody> + </tgroup> + </informaltable> + + <para> + A single normalizer is applied directly: + </para> + <programlisting language="java"> +<![CDATA[CharSequenceNormalizer ws = WhitespaceCharSequenceNormalizer.getInstance(); +String clean = ws.normalize("a b").toString(); // "a b" + +String hyphen = DashCharSequenceNormalizer.getInstance() + .normalize("state—of–the‐art").toString(); // "state-of-the-art"]]> + </programlisting> + </section> + + <section xml:id="tools.normalizer.pipeline"> + <title>Composing a pipeline</title> + <para> + <code>TextNormalizer</code> is a fluent builder that composes the rungs, in the order + they are added, into a single <code>CharSequenceNormalizer</code>: + </para> + <programlisting language="java"> +<![CDATA[CharSequenceNormalizer pipeline = TextNormalizer.builder() + .nfc() + .caseFold() + .accentFold() + .build(); + +String term = pipeline.normalize("CAFÉ").toString(); // "cafe"]]> + </programlisting> + <para> + A conservative search-oriented chain (strip invisibles, NFC, collapse whitespace, fold + quotes and dashes, case fold, then script-gated accent fold) is available directly: + </para> + <programlisting language="java"> +<![CDATA[CharSequenceNormalizer search = TextNormalizer.searchDefault(); + +// byte order mark stripped, curly quotes folded, case and accent folded +String t = search.normalize("“CafÉ”").toString(); // "\"cafe\""]]> + </programlisting> + <para> + Any custom <code>CharSequenceNormalizer</code> can be inserted with + <code>with(...)</code>. The <code>TextNormalizer</code> pipeline and the individual + <code>CharSequenceNormalizer</code> implementations are not applied automatically by + statistical OpenNLP components; callers compose them explicitly when preprocessing text + for search or matching. The DL components described in the next section use a narrower, + built-in subset of this machinery. + </para> + </section> + + <section xml:id="tools.normalizer.dl"> + <title>Use in DL components</title> + <para> + <code>NameFinderDL</code> and <code>DocumentCategorizerDL</code> share Unicode-aware text + handling through <code>AbstractDL</code>. Long inputs are split into overlapping chunks + on the full Unicode <code>White_Space</code> set (no-break space, ideographic space, line + and paragraph separators, and the other members listed under + <xref linkend="tools.normalizer.reference"/>), not on Java's six-character + <code>\s</code> subset. Empty tokens from leading, trailing, or repeated whitespace are + not produced. + </para> + <para> + <code>NameFinderDL</code> additionally locates reconstructed entity text in the original + input with a cursor-based matcher: a space in the reconstructed span matches zero or more + Unicode whitespace code points in the source, and every other code point is compared + case-insensitively. This replaces the previous regular-expression approach and correctly + handles spacing copied from PDFs, the web, or non-Latin sources when resolving + <code>Span#getCoveredText(...)</code>. + </para> + <para> + Optional input folding is controlled through <code>InferenceOptions</code> and is + <emphasis>off by default</emphasis> so existing models keep their prior inputs unless + you opt in: + </para> + <itemizedlist> + <listitem> + <para> + <code>setNormalizeWhitespace(true)</code> maps each Unicode whitespace code point + to a single ASCII space before inference. The transform is one code point to one + space, so character offsets stay aligned with the input. + </para> + </listitem> + <listitem> + <para> + <code>setNormalizeDashes(true)</code> maps each dash in the default + <code>CharClass.dashes()</code> set to the ASCII hyphen-minus. Mathematical minus + signs and the soft hyphen are not affected unless you extend the set explicitly. + This replacement is also one code point to one character for Basic Multilingual + Plane dashes. + </para> + </listitem> + </itemizedlist> + <para> + Run-collapsing normalization (for example <code>WhitespaceCharSequenceNormalizer</code>, + which collapses whitespace <emphasis>runs</emphasis> to a single space) is not enabled + through these flags because it would shift character offsets. Use the + <code>CharSequenceNormalizer</code> pipeline directly when you need that behavior on text + that does not require offset-preserving span lookup. See also + <xref linkend="tools.namefind.api.onnx"/> and + <xref linkend="tools.doccat.api.onnx"/>. + </para> + <programlisting language="java"> +<![CDATA[InferenceOptions options = new InferenceOptions(); +options.setNormalizeWhitespace(true); // opt-in: NBSP, ideographic space, ... -> ASCII space +options.setNormalizeDashes(true); // opt-in: en dash, em dash, ... -> hyphen-minus + +NameFinderDL finder = new NameFinderDL(model, vocab, ids2Labels, options, sentenceDetector);]]> + </programlisting> + </section> + + <section xml:id="tools.normalizer.accentfold"> + <title>Diacritic folding and multilingual safety</title> + <para> + <code>AccentFoldCharSequenceNormalizer</code> folds accents for search, but does so in a + script-aware way that a Latin-only folding filter cannot. It decomposes the text, then + drops nonspacing combining marks only for base characters whose script is configured for + folding (Latin, Greek, and Cyrillic by default). Combining marks on other scripts are + left untouched, because there they are essential orthography rather than decoration: + dropping an Indic vowel sign or virama, an Arabic harakat, a Hebrew point, or a Thai + vowel would change the word. + </para> + <programlisting language="java"> +<![CDATA[CharSequenceNormalizer fold = AccentFoldCharSequenceNormalizer.getInstance(); + +fold.normalize("café"); // "cafe" (Latin accent folded) +fold.normalize("ά"); // "α" (Greek alpha-with-tonos -> alpha) +fold.normalize("का"); // unchanged (Devanagari is left intact)]]> + </programlisting> + <para> + Atomic Latin letters that do not decompose are mapped to an ASCII approximation by + default: for example the stroke letters and ligatures, eszett, and thorn + (<code>ø -> o</code>, <code>æ -> ae</code>, <code>ß -> ss</code>, + <code>þ -> th</code>). Both behaviors are configurable through the constructor: + </para> + <programlisting language="java"> +<![CDATA[// fold only Latin, and do not map the stroke letters +CharSequenceNormalizer latinOnly = new AccentFoldCharSequenceNormalizer( + java.util.Set.of(Character.UnicodeScript.LATIN), false);]]> + </programlisting> + <para> + Diacritic folding is a recall optimization, not a linguistically correct transform, so it + is intended for a search or matching form rather than for display. Language-specific case + and letter rules (for example German <code>DIN</code> umlaut expansion, or the Turkish + dotless-i) are out of scope for the default folder and should be applied with an explicit + locale upstream. + </para> + </section> + + <section xml:id="tools.normalizer.charclass"> + <title>The CharClass engine and code point sets</title> + <para> + The set-based normalizers are built on <code>CharClass</code>, a configurable class of + Unicode code points paired with a single canonical replacement, backed by a + <code>CodePointSet</code> with O(1) membership. You choose both the membership and the + replacement code point with <code>CharClass.of(members, replacement)</code>; whitespace and + dashes are the two built-in presets, and any other class is one more configured instance: + </para> + <programlisting language="java"> +<![CDATA[CharClass ws = CharClass.whitespace(); // Unicode White_Space -> U+0020 +CharClass dash = CharClass.dashes(); // Unicode Dash (curated) -> U+002D + +ws.collapse("a b"); // "a b" (runs -> one space) +ws.trim(" hi "); // "hi" +String[] tokens = ws.split("one two"); // ["one", "two"] (offset-aware via splitSpans) +dash.normalize("a—b"); // "a-b"]]> + </programlisting> + <para> + A class applies its replacement three ways, which differ in whether they collapse runs and + whether they preserve character offsets: + </para> + <itemizedlist> + <listitem> + <para> + <code>normalize(text)</code> replaces each member one-for-one with the replacement, + so it is length- and offset-preserving; use it when you still need spans back into + the original text. + </para> + </listitem> + <listitem> + <para> + <code>collapse(text)</code> reduces each maximal run of members to a single + replacement; it changes length, so it is a search and match transform. + </para> + </listitem> + <listitem> + <para> + <code>collapsePreserving(text, keep, keepReplacement)</code> collapses runs but emits + <code>keepReplacement</code> for any run containing a kept code point, which is how + you squish horizontal whitespace while keeping line breaks. + </para> + </listitem> + </itemizedlist> + <para> + So the replacement is your choice and the method picks the behavior. Folding tabs and + newlines to a single newline, for example, is one configured class: + </para> + <programlisting language="java"> +<![CDATA[CharClass lineFold = CharClass.of(CodePointSet.of('\n', '\t'), '\n'); +lineFold.collapse("\n\n\n\t\n"); // "\n" (the whole run folds to one newline) + +CharClass ws = CharClass.whitespace(); +ws.collapsePreserving(text, CodePointSet.of('\n'), '\n'); // squish spaces, keep paragraph breaks]]> + </programlisting> + <para> + When you need the normalized form together with a map back to the original, the + <code>normalizeMapped</code> and <code>collapseMapped</code> variants return a + <code>NormalizedText</code> that carries the offset map. + </para> + <para> + A <code>CodePointSet</code> can be built explicitly, as a range, by union, or loaded from + a user definitions file so that delimiters can be extended without a code change. The + file is line oriented and parsed with the same cursor approach (no regular expression): a + <code>[name]</code> line opens a section, a <code>#</code> begins a comment, and each + remaining line is a hex code point or an inclusive range. + </para> + <programlisting language="none"> +<![CDATA[[whitespace] +U+00A0 # no-break space +U+2000-U+200A # typographic spaces + +[dash] +U+2E5D # oblique hyphen]]> + </programlisting> + <programlisting language="java"> +<![CDATA[CodePointSet extra = CodePointSet.fromFile(path, "whitespace"); +CharClass wsPlus = CharClass.whitespace().withAdditional(extra);]]> + </programlisting> + </section> + + <section xml:id="tools.normalizer.term"> + <title>The layered term model</title> + <para> + <code>TermAnalyzer</code> tokenizes text and gives each token a + <emphasis>stack</emphasis> of normalization layers while keeping its source span. It is the + offset-preserving entry point for matching and BM25-style search: the normalized form is + what you index or query, and the span ties every layer back to the original text for + highlighting, even when normalization changes a token's length. A <code>Term</code> is one + token projected through an ordered chain of + <code>Dimension</code>s: original, NFC, NFKC, whitespace, dash, case fold, accent fold, + confusable fold, stem, and lemma. The order is fixed because the transforms do not commute + (case folding then accent folding differs from the reverse). The original is always kept, + so aggressive folding stays safe and a match on any layer maps back to the source through + the token's <code>Span</code>. + </para> + <programlisting language="java"> +<![CDATA[TermAnalyzer analyzer = TermAnalyzer.builder() + .caseFold() + .stem(new PorterStemmer()) + .build(); + +Term term = analyzer.analyze("Running").get(0); +// term.original() -> "Running" +// term.normalized() -> "run" (the final configured dimension, here STEM) +// term.peel() -> "running" (the layer below the top, O(1)) +// term.at(Dimension.NFC) -> computed lazily on first request, then cached]]> + </programlisting> + <para> + Segmentation uses the <xref linkend="tools.tokenizer.uax29"/> word tokenizer, so the input + does not need to be pre-tokenized. The dimensions named in the builder are computed eagerly; + any other dimension is computed on first request, applied on top of the final form, and + cached, so querying a configured layer or peeling the last one is O(1) and adding an + unrequested dimension costs one transform. The character-level dimensions have built-in + defaults; <code>STEM</code> and <code>LEMMA</code> require a + <code>Stemmer</code> or <code>Lemmatizer</code> (and <code>LEMMA</code> a part-of-speech + tag), and fail loudly if requested without them. An analyzer configured with a stemmer is + not thread-safe, because the Snowball stemmers are stateful. + </para> + <para> + Each dimension's transform is configurable on the builder. Beyond the no-argument methods + that enable a dimension with its default, there are convenience methods for the common + knobs, and a general <code>transform(dimension, normalizer)</code> escape hatch for any + character-level dimension: + </para> + <programlisting language="java"> +<![CDATA[TermAnalyzer analyzer = TermAnalyzer.builder() + .whitespace(CharClass.of(CodePointSet.of('\n', '\t'), '\n')::collapse) // custom target/behavior + .caseFold(Locale.forLanguageTag("tr")) // Turkish case rules + .accentFold(Set.of(Character.UnicodeScript.LATIN), false) // fold only Latin + .maxTokenLength(255) // tokenizer chopping + .build();]]> + </programlisting> + <para> + The whitespace and dash methods take any <code>CharSequenceNormalizer</code>, so a + <code>CharClass</code> method reference (<code>::normalize</code> for one-for-one, + <code>::collapse</code> for run-collapsing) selects both the fold target and the behavior. + The case-fold method takes a <code>Locale</code> for language-specific rules such as the + Turkish dotted/dotless i, and the accent-fold method takes the scripts to fold and whether + to fold stroke letters. + </para> + </section> + + <section xml:id="tools.normalizer.confusables"> + <title>Confusable (homoglyph) folding</title> + <para> + <code>Confusables</code> reduces text to its Unicode confusable + <emphasis>skeleton</emphasis> following + <link xlink:href="https://www.unicode.org/reports/tr39/">UTS #39</link>: it decomposes the + text, replaces each code point with its prototype, and decomposes again. Two strings are + confusable exactly when their skeletons are equal, which catches spoofing where Cyrillic or + Greek letters imitate Latin ones. + </para> + <programlisting language="java"> +<![CDATA[// "paypal" with Cyrillic a's looks identical but is a different code-point sequence +Confusables.confusable("paypal", spoofed); // true +Confusables.skeleton("paypal"); // a matching key, not readable text]]> + </programlisting> + <para> + The skeleton changes length and offsets, so like accent folding it is a derived, + matching-only form. It is also available as + <code>ConfusableSkeletonCharSequenceNormalizer</code> and as the + <code>CONFUSABLE_FOLD</code> term dimension. The mapping comes from the bundled Unicode + security data file <code>confusables.txt</code>. + </para> + </section> + + <section xml:id="tools.normalizer.language"> + <title>Per-language profiles</title> + <para> + <code>NormalizationProfiles</code> selects per-language settings the same way OpenNLP + already selects a Snowball stemmer by language: ask for a language, or detect it with a + <code>LanguageDetector</code> when it is unspecified. Each + <code>NormalizationProfile</code> pairs a language with its Snowball stemmer and the + diacritic fold appropriate for that language, and builds a search-oriented + <code>TermAnalyzer</code>. + </para> + <programlisting language="java"> +<![CDATA[NormalizationProfile german = NormalizationProfiles.forLanguage("de").orElseThrow(); +TermAnalyzer analyzer = german.searchAnalyzer(); // NFC, case fold, German fold, German stemmer +// "Mueller" and "Müller" both reduce to the same search term + +// detect the language when it is not known +NormalizationProfiles.detect(text, languageDetector).map(NormalizationProfile::searchAnalyzer);]]> + </programlisting> + <para> + The diacritic fold is the generic accent fold for English and the major Romance languages, + the German-specific fold (a-umlaut to <code>ae</code>, eszett to <code>ss</code>, following + DIN 5007-2) for German, and none for the Nordic languages and non-Latin scripts, where + folding distinct letters is language-wrong. As stated in + <xref linkend="tools.normalizer.accentfold"/>, this is a search-recall choice, not + linguistic correctness; a caller that wants different behavior builds a + <code>TermAnalyzer</code> directly. + </para> + </section> + + <section xml:id="tools.normalizer.reference"> + <title>Reference data</title> + <para> + The underlying Unicode data is also available directly as immutable reference tables, + with O(1) membership tests that match the Unicode standard: + </para> + <itemizedlist> + <listitem> + <para> + <code>UnicodeWhitespace</code> lists the 25 characters carrying the + <code>White_Space</code> property, plus the related look-alike format characters + (zero width space, byte order mark, ...) that are <emphasis>not</emphasis> + whitespace. It exposes <code>isWhitespace(int)</code>, + <code>byCodePoint(int)</code>, and helpers for the line breaks and the + non-breaking spaces. + </para> + </listitem> + <listitem> + <para> + <code>UnicodeDash</code> lists every code point carrying the <code>Dash</code> + property, distinguishing the mathematical minus signs that are excluded from the + default normalization set. + </para> + </listitem> + </itemizedlist> + </section> + +</chapter> diff --git a/opennlp-docs/src/docbkx/opennlp.xml b/opennlp-docs/src/docbkx/opennlp.xml index 67eb1edf1..843bfbc9b 100644 --- a/opennlp-docs/src/docbkx/opennlp.xml +++ b/opennlp-docs/src/docbkx/opennlp.xml @@ -101,6 +101,7 @@ under the License. <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./langdetect.xml" /> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./sentdetect.xml"/> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./tokenizer.xml" /> + <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./normalizer.xml" /> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./stopword.xml" /> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./namefinder.xml" /> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./doccat.xml" /> diff --git a/opennlp-docs/src/docbkx/tokenizer.xml b/opennlp-docs/src/docbkx/tokenizer.xml index b6fb7b074..7bb3356de 100644 --- a/opennlp-docs/src/docbkx/tokenizer.xml +++ b/opennlp-docs/src/docbkx/tokenizer.xml @@ -23,7 +23,16 @@ The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc. - + </para> + <para> + The statistical tokenizers in this chapter assume conventional whitespace-separated training + and test data. When input contains Unicode spacing or dash variants (no-break space, + ideographic space, en dash, and similar characters from PDFs or the web), use the + Unicode-aware preprocessing described in <xref linkend="tools.normalizer"/>. The DL + components apply that machinery automatically for document chunking; see + <xref linkend="tools.normalizer.dl"/>. + </para> + <para> <screen> <![CDATA[Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. @@ -443,4 +452,84 @@ DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations) </para> </section> </section> + + <section xml:id="tools.tokenizer.uax29"> + <title>Unicode Word Segmentation (UAX #29)</title> + <para> + The package <code>opennlp.tools.tokenize.uax29</code> provides a tokenizer that follows the + Unicode Text Segmentation algorithm + (<link xlink:href="https://www.unicode.org/reports/tr29/">UAX #29</link>), word boundary + rules WB1 through WB999. It is rule based and needs no trained model, it works directly over + a <code>CharSequence</code>, and it reports character offsets so the original text is + preserved for downstream processing such as the normalization described in + <xref linkend="tools.normalizer"/>. The boundary data comes from the bundled Unicode + Character Database (currently Unicode 17.0) and the implementation passes the official + <code>WordBreakTest</code> conformance suite for that release. + </para> + <section xml:id="tools.tokenizer.uax29.segmenter"> + <title>Word Segmenter</title> + <para> + <code>WordSegmenter</code> finds the word boundaries. It is a single forward cursor pass + with constant-time property look-ups and no regular expression. Every segment is + reported, including whitespace and punctuation runs, so the segments are contiguous and + together cover the whole text. + <programlisting language="java"> +<![CDATA[int[] boundaries = WordSegmenter.boundaries("The quick brown fox."); +List<Span> segments = WordSegmenter.segments("The quick brown fox.");]]> + </programlisting> + For allocation-free processing of large inputs, stream the segments to a callback instead + of collecting them. + <programlisting language="java"> +<![CDATA[WordSegmenter.forEachSegment("The quick brown fox.", (start, end) -> { + // handle the segment [start, end) +});]]> + </programlisting> + </para> + </section> + <section xml:id="tools.tokenizer.uax29.tokenizer"> + <title>Word Tokenizer</title> + <para> + <code>WordTokenizer</code> builds on the segmenter. It keeps the segments that are words + (letters, digits, ideographs, kana, Hangul, a Southeast Asian script, or emoji), drops + whitespace and punctuation, and classifies each token. It implements the standard + <code>Tokenizer</code> interface, so it can be used wherever a tokenizer is expected. + <programlisting language="java"> +<![CDATA[Tokenizer tokenizer = new WordTokenizer(); +String[] tokens = tokenizer.tokenize("The quick brown fox."); +Span[] spans = tokenizer.tokenizePos("The quick brown fox.");]]> + </programlisting> + The tokens array contains "The", "quick", "brown", and "fox"; the trailing period and the + spaces are dropped. The <code>tokenizeTyped</code> method additionally returns the + category of each token as a <code>WordType</code>. + <programlisting language="java"> +<![CDATA[WordTokenizer wordTokenizer = new WordTokenizer(); +for (WordToken token : wordTokenizer.tokenizeTyped("OpenNLP 3.0")) { + System.out.println(token.text("OpenNLP 3.0") + " : " + token.type()); +} +// OpenNLP : ALPHANUMERIC +// 3.0 : NUMERIC]]> + </programlisting> + The categories are <code>ALPHANUMERIC</code>, <code>NUMERIC</code>, + <code>IDEOGRAPHIC</code>, <code>HIRAGANA</code>, <code>KATAKANA</code>, + <code>HANGUL</code>, <code>SOUTHEAST_ASIAN</code>, and <code>EMOJI</code>. + </para> + <para> + A streaming overload reports each token to a handler with no per-token allocation, which + is the fastest option when the tokens are consumed on the fly. + <programlisting language="java"> +<![CDATA[WordTokenizer wordTokenizer = new WordTokenizer(); +wordTokenizer.tokenize("The quick brown fox.", (start, end, type) -> { + // handle the token [start, end) of the given WordType +});]]> + </programlisting> + A token longer than the maximum token length is emitted as consecutive pieces without + splitting a surrogate pair. The maximum defaults to + <code>WordTokenizer.DEFAULT_MAX_TOKEN_LENGTH</code> and can be set through the + constructor. + <programlisting language="java"> +<![CDATA[WordTokenizer wordTokenizer = new WordTokenizer(64);]]> + </programlisting> + </para> + </section> + </section> </chapter>
