(opennlp) branch main updated: OPENNLP-1846: Fix NameFinderDL only worked with Person, expand to all types (#1086)

mawiesne Fri, 19 Jun 2026 04:13:59 -0700

This is an automated email from the ASF dual-hosted git repository.

mawiesne pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/opennlp.git



The following commit(s) were added to refs/heads/main by this push:
     new 4dcf7eff4 OPENNLP-1846: Fix NameFinderDL only worked with Person, 
expand to all types (#1086)
4dcf7eff4 is described below

commit 4dcf7eff4cee296d9fdd9bee28479b8716942bfa
Author: Kristian Rickert <[email protected]>
AuthorDate: Fri Jun 19 07:13:37 2026 -0400

    OPENNLP-1846: Fix NameFinderDL only worked with Person, expand to all types 
(#1086)
    
    NameFinderDL only decoded B-PER/I-PER and put the matched text in
    Span.getType() instead of the entity label. Decode the BIO sequence
    generically and harden it:
    
    - Any B-<TYPE> begins a span whose type is the label minus the B- prefix
      (B-ORG -> ORG), extending while the following labels are I-<same type>.
      Span.getType() now reports the entity label (PER, ORG, LOC, ...) and
      ids2Labels fully drives recognition for any BIO-tagged model.
    - isBeginLabel() requires a non-empty type after "B-", so a malformed "B-"
      label no longer starts an empty-type span. An argmax index with no entry
      in ids2Labels fails loudly instead of being silently skipped.
    - Span.getProb() is now a numerically stable softmax over the token's label
      scores (bounded to [0,1]) instead of the raw max logit; handles +Inf,
      all-(-Inf) and NaN edge cases.
    - find() inference is fail-loud and consistent with the sibling
      DocumentCategorizerDL: failures surface as IllegalStateException (cause
      preserved) and an unexpected/empty model-output shape is its own loud
      failure, rather than a bare RuntimeException or raw ClassCastException.
    - Floor the character-search cursor at each sentence's start (via
      sentPosDetect) and thread it forward across that sentence's chunks, so a
      repeated entity surface form is located at its own occurrence instead of
      being re-matched against an earlier one -- which previously emitted
      duplicate or mis-located spans for multi-sentence/multi-chunk input.
    - Span text reconstruction matches the source with flexible whitespace
      (\s*), so entities whose wordpiece tokenization splits internal
      punctuation or "&" apart (U.S.A, AT&T) are still located instead of
      silently dropped.
    - Remove the now-unused SpanEnd record.
    - Extract decodeSpans()/predictLabel()/findEntityEnd()/buildSpanText() and
      expose labelProbability()/maxIndex() for unit testing without an ONNX
      model; add NameFinderDLTest coverage for entity types, bounded and
      edge-case probabilities, malformed begin labels, wordpiece
      reconstruction, internal-punctuation and case-insensitive matching,
      missing labels, and cursor-threaded span location.
    - Reconcile the OPENNLP-1844 concurrency/snapshot eval tests with the new
      all-types output (the George-Washington input now yields PER + LOC) and
      assert span types and covered text.
    
    Keep unmapped label ids graceful, bound decoded span lookup to the current 
sentence, add diagnostics for unlocated decoded spans, and tighten exception 
types/messages plus helper documentation.
    
    Make the public no-space token constants immutable (Set.of instead of
    mutable arrays) while keeping them public for third-party use.
    
    Fail loud on an unmapped model output index: predictLabel now throws an
    IllegalStateException naming the index instead of degrading the token to
    "O", and the constructors document that ids2Labels must be exhaustive over
    the model's output indices. Also document the IllegalArgumentException that
    find() can raise on a vocabulary/model mismatch.
    
    Add edge-case decoding tests: token/score count mismatch, orphan I- labels,
    adjacent entities of different types, multi-token minimum-probability
    semantics, repeated entities at distinct offsets within one call, regex
    metacharacters in span text, and search-start clamping past end of text.
---
 opennlp-core/opennlp-ml/opennlp-dl/README.md       |  16 +-
 .../src/main/java/opennlp/dl/SpanEnd.java          |  27 --
 .../java/opennlp/dl/namefinder/NameFinderDL.java   | 535 +++++++++++++++------
 .../opennlp/dl/namefinder/NameFinderDLTest.java    | 368 ++++++++++++++
 .../opennlp/dl/namefinder/NameFinderDLEval.java    |  60 ++-
 5 files changed, 812 insertions(+), 194 deletions(-)

diff --git a/opennlp-core/opennlp-ml/opennlp-dl/README.md 
b/opennlp-core/opennlp-ml/opennlp-dl/README.md
index 912cd983d..04a7715d4 100644
--- a/opennlp-core/opennlp-ml/opennlp-dl/README.md
+++ b/opennlp-core/opennlp-ml/opennlp-dl/README.md
@@ -8,7 +8,21 @@ Models used in the tests are available in the [opennlp 
evaluation test data](htt
 
 ## NameFinderDL
 
-Export a Huggingface NER model to ONNX, e.g.:
+`NameFinderDL` runs ONNX token-classification models that use BIO labels. Any
+label in the form `B-<TYPE>` starts an entity and subsequent `I-<TYPE>` labels
+continue that entity. The text after the prefix is reported as the OpenNLP span
+type, for example `B-PER` and `I-PER` produce spans with type `PER`.
+
+The finder uses BERT basic tokenization followed by WordPiece tokenization and
+then maps the reconstructed WordPiece text back to the caller's original input
+so returned spans can be used with `Span#getCoveredText(...)`. Span 
probabilities
+are normalized from the model logits and are reported in the range `(0, 1]`.
+
+Named entity models are commonly cased, so lower casing is disabled by default.
+Set `InferenceOptions#setLowerCase(true)` only for models trained with uncased
+input.
+
+Export a Hugging Face NER model to ONNX, e.g.:
 
 ```bash
 python -m transformers.onnx --model=dslim/bert-base-NER --feature 
token-classification exported
diff --git 
a/opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/SpanEnd.java 
b/opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/SpanEnd.java
deleted file mode 100644
index 2c91c1928..000000000
--- a/opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/SpanEnd.java
+++ /dev/null
@@ -1,27 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package opennlp.dl;
-
-public record SpanEnd(int index, int characterEnd) {
-
-  @Override
-  public String toString() {
-    return "index: " + index + "; character end: " + characterEnd;
-  }
-
-}
diff --git 
a/opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java
 
b/opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java
index 3445969e8..e5b5c89b5 100644
--- 
a/opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java
+++ 
b/opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java
@@ -20,22 +20,25 @@ package opennlp.dl.namefinder;
 import java.io.File;
 import java.io.IOException;
 import java.nio.LongBuffer;
+import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.HashMap;
 import java.util.LinkedList;
 import java.util.List;
 import java.util.Map;
 import java.util.Objects;
+import java.util.Set;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 
 import ai.onnxruntime.OnnxTensor;
 import ai.onnxruntime.OrtException;
 import ai.onnxruntime.OrtSession;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 import opennlp.dl.AbstractDL;
 import opennlp.dl.InferenceOptions;
-import opennlp.dl.SpanEnd;
 import opennlp.dl.Tokens;
 import opennlp.tools.commons.ThreadSafe;
 import opennlp.tools.namefind.TokenNameFinder;
@@ -67,14 +70,31 @@ import opennlp.tools.util.Span;
 @ThreadSafe
 public class NameFinderDL extends AbstractDL implements TokenNameFinder {
 
+  /** Example person labels; retained for reference. Decoding handles any 
B-/I- type. */
   public static final String I_PER = "I-PER";
   public static final String B_PER = "B-PER";
   public static final String SEPARATOR = "[SEP]";
+  public static final String CLS_TOKEN = "[CLS]";
+
+  /** Prefix used by BIO labels for the first token in an entity span. */
+  public static final String PREFIX_BEGIN = "B-";
+
+  /** Prefix used by BIO labels for continuation tokens in an entity span. */
+  public static final String PREFIX_INSIDE = "I-";
+
+  /** Tokens that attach directly to the preceding token when span text is 
reconstructed. */
+  public static final Set<String> NO_SPACE_BEFORE_TOKENS =
+      Set.of(".", ",", ":", ";", "!", "?", ")", "]", "}", "%", "'", "-", "/");
+
+  /** Tokens after which the following token attaches directly when span text 
is reconstructed. */
+  public static final Set<String> NO_SPACE_AFTER_TOKENS =
+      Set.of("(", "[", "{", "$", "'", "-", "/");
 
   /** NER models are commonly cased, so lower casing is off by default. */
   private static final boolean LOWER_CASE_DEFAULT = false;
 
   private static final String CHARS_TO_REPLACE = "##";
+  private static final Logger logger = 
LoggerFactory.getLogger(NameFinderDL.class);
 
   private final SentenceDetector sentenceDetector;
   private final Map<Integer, String> ids2Labels;
@@ -90,7 +110,9 @@ public class NameFinderDL extends AbstractDL implements 
TokenNameFinder {
    * 
    * @param model The ONNX model file.
    * @param vocabulary The model file's vocabulary file.
-   * @param ids2Labels The mapping of ids to labels.
+   * @param ids2Labels The mapping of model output indices to BIO labels. This 
must be exhaustive
+   *     over the model's output indices; a token whose predicted index is 
unmapped raises an
+   *     {@link IllegalStateException} during {@link #find(String[])}.
    * @param sentenceDetector The {@link SentenceDetector} to be used.
    *
    * @throws OrtException Thrown if the {@code model} cannot be loaded.
@@ -108,7 +130,9 @@ public class NameFinderDL extends AbstractDL implements 
TokenNameFinder {
    *
    * @param model The ONNX model file.
    * @param vocabulary The model file's vocabulary file.
-   * @param ids2Labels The mapping of ids to labels.
+   * @param ids2Labels The mapping of model output indices to BIO labels. This 
must be exhaustive
+   *     over the model's output indices; a token whose predicted index is 
unmapped raises an
+   *     {@link IllegalStateException} during {@link #find(String[])}.
    * @param inferenceOptions {@link InferenceOptions} to control the inference.
    * @param sentenceDetector The {@link SentenceDetector} to be used.
    *
@@ -141,249 +165,448 @@ public class NameFinderDL extends AbstractDL implements 
TokenNameFinder {
     return inferenceOptions;
   }
 
+  /**
+   * {@inheritDoc}
+   *
+   * <p>This method joins the provided tokens with spaces, sentence-splits the 
joined text,
+   * runs each sentence through the ONNX token-classification model, decodes 
BIO labels into
+   * {@link Span spans}, and resolves those spans back to character offsets in 
the joined text.</p>
+   *
+   * @throws IllegalStateException Thrown if inference fails, if the model 
output shape is not
+   *     the expected {@code float[batch][token][label]} form, if the model 
output contains
+   *     no usable label score for a token, or if the model's predicted index 
for a token is not
+   *     present in the configured label map.
+   * @throws IllegalArgumentException Thrown if a token produced for the input 
is not present in
+   *     the vocabulary, which indicates the vocabulary file does not match 
the model.
+   */
   @Override
   public Span[] find(String[] input) {
 
-    final List<Span> spans = new LinkedList<>();
+    final List<Span> spans = new ArrayList<>();
 
     // Join the tokens here because they will be tokenized using Wordpiece 
during inference.
     final String text = String.join(" ", input);
 
-    final String[] sentences = sentenceDetector.sentDetect(text);
+    // sentPosDetect (not sentDetect) so each sentence's offset in the full 
text is known.
+    final Span[] sentenceSpans = sentenceDetector.sentPosDetect(text);
 
-    for (String sentence : sentences) {
+    for (final Span sentenceSpan : sentenceSpans) {
+
+      // Floor the character cursor at this sentence's start, then thread it 
forward across the
+      // sentence's chunks so a repeated surface form is located at its next 
occurrence. Flooring
+      // per sentence keeps an entity from being matched against an identical 
surface form in an
+      // earlier sentence -- even one that produced no spans, which would 
otherwise leave the
+      // cursor behind and mis-locate the match.
+      int searchStart = sentenceSpan.getStart();
 
       // The WordPiece tokenized text. This changes the spacing in the text.
-      final List<Tokens> wordpieceTokens = tokenize(sentence);
+      final List<Tokens> wordpieceTokens = 
tokenize(sentenceSpan.getCoveredText(text).toString());
 
       for (final Tokens tokens : wordpieceTokens) {
+        final List<Span> decoded =
+            decodeSpans(text, tokens.tokens(), infer(tokens), ids2Labels, 
searchStart,
+                sentenceSpan.getEnd());
+        spans.addAll(decoded);
+        if (!decoded.isEmpty()) {
+          searchStart = decoded.get(decoded.size() - 1).getEnd();
+        }
+      }
 
-        try {
-
-          // The inputs to the ONNX model.
-          final Map<String, OnnxTensor> inputs = new HashMap<>();
-
-          final float[][][] v;
-          try {
-            inputs.put(INPUT_IDS, OnnxTensor.createTensor(env, 
LongBuffer.wrap(tokens.ids()),
-                new long[] {1, tokens.ids().length}));
-
-            if (includeAttentionMask) {
-              inputs.put(ATTENTION_MASK, OnnxTensor.createTensor(env,
-                  LongBuffer.wrap(tokens.mask()), new long[] {1, 
tokens.mask().length}));
-            }
-
-            if (includeTokenTypeIds) {
-              inputs.put(TOKEN_TYPE_IDS, OnnxTensor.createTensor(env,
-                  LongBuffer.wrap(tokens.types()), new long[] {1, 
tokens.types().length}));
-            }
-
-            // The outputs from the model.
-            try (OrtSession.Result result = session.run(inputs)) {
-              // getValue() copies the tensor into Java arrays, so the result 
can be closed safely.
-              v = (float[][][]) result.get(0).getValue();
-            }
-          } finally {
-            inputs.values().forEach(OnnxTensor::close);
-          }
-
-          // Find consecutive B-PER and I-PER labels and combine the spans 
where necessary.
-          // There are also B-LOC and I-LOC tags for locations that might be 
useful at some point.
-
-          // Keep track of where the last span was so when there are 
multiple/duplicate
-          // spans we can get the next one instead of the first one each time.
-          int characterStart = 0;
+    }
 
-          final String[] toks = tokens.tokens();
+    return spans.toArray(new Span[0]);
 
-          // We are looping over the vector for each word,
-          // finding the index of the array that has the maximum value,
-          // and then finding the token classification that corresponds to 
that index.
-          for (int x = 0; x < v[0].length; x++) {
+  }
 
-            final float[] arr = v[0][x];
-            final int maxIndex = maxIndex(arr);
-            final String label = ids2Labels.get(maxIndex);
+  /**
+   * Runs the model on one token window and returns the per-token label score 
rows. A failure
+   * executing the model (an {@link OrtException} or any runtime fault) is 
surfaced as an
+   * {@link IllegalStateException} (cause preserved); an unexpected output 
shape is its own loud
+   * failure. This mirrors the fail-loud contract of the sibling {@code 
DocumentCategorizerDL}.
+   *
+   * @param tokens The tokens for one chunk to run inference on.
+   * @return The {@code [token][label]} score matrix for the chunk.
+   */
+  private float[][] infer(final Tokens tokens) {
 
-            // TODO: Need to make sure this value is between 0 and 1?
-            // Can we do thresholding without it between 0 and 1?
-            final double confidence = arr[maxIndex]; // / 10;
+    final Map<String, OnnxTensor> inputs = new HashMap<>();
+    final Object output;
+    try {
+      inputs.put(INPUT_IDS, OnnxTensor.createTensor(env, 
LongBuffer.wrap(tokens.ids()),
+          new long[] {1, tokens.ids().length}));
 
-            // Is this is the start of a person entity.
-            if (B_PER.equals(label)) {
+      if (includeAttentionMask) {
+        inputs.put(ATTENTION_MASK, OnnxTensor.createTensor(env,
+            LongBuffer.wrap(tokens.mask()), new long[] {1, 
tokens.mask().length}));
+      }
 
-              String spanText;
+      if (includeTokenTypeIds) {
+        inputs.put(TOKEN_TYPE_IDS, OnnxTensor.createTensor(env,
+            LongBuffer.wrap(tokens.types()), new long[] {1, 
tokens.types().length}));
+      }
 
-              // Find the end index of the span in the array (where the label 
is not I-PER).
-              final SpanEnd spanEnd = findSpanEnd(v, x, ids2Labels, toks);
+      // getValue() copies the tensor into Java arrays, so the result can be 
closed safely.
+      try (OrtSession.Result result = session.run(inputs)) {
+        output = result.get(0).getValue();
+      }
+    } catch (OrtException ex) {
+      throw new IllegalStateException(
+          "Unable to perform name finder inference: " + ex.getMessage(), ex);
+    } catch (RuntimeException ex) {
+      throw new IllegalStateException(
+          "Unexpected runtime failure during name finder inference: " + 
ex.getMessage(), ex);
+    } finally {
+      inputs.values().forEach(OnnxTensor::close);
+    }
 
-              // If the end is -1 it means this is a single-span token.
-              // If the end is != -1 it means this is a multi-span token.
-              if (spanEnd.index() != -1) {
+    // The model returns one score row per token, batched: 
float[batch][token][label]. Any other
+    // shape (or an empty batch) is a model-contract violation, surfaced on 
its own rather than as
+    // "inference failed".
+    if (output instanceof float[][][] v) {
+      if (v.length == 0) {
+        throw new IllegalStateException("Model output batch must contain at 
least one entry.");
+      }
+      return v[0];
+    }
+    throw new IllegalStateException("Unexpected model output type: "
+        + (output == null ? "null" : output.getClass().getName()));
+  }
 
-                final StringBuilder sb = new StringBuilder();
+  @Override
+  public void clearAdaptiveData() {
+    // No use in this implementation.
+  }
 
-                // We have to concatenate the tokens.
-                // Add each token in the array and separate them with a space.
-                // We'll separate each with a single space because later we'll 
find the original span
-                // in the text and ignore spacing between individual tokens in 
findByRegex().
-                int end = spanEnd.index();
-                for (int i = x; i <= end; i++) {
+  /**
+   * Decodes {@link Span spans} beginning the character search at the start of 
{@code text}. Equivalent to
+   * {@link #decodeSpans(String, String[], float[][], Map, int)} with {@code 
searchStart == 0}.
+   *
+   * @param text The original text passed to the model.
+   * @param tokens The WordPiece tokens produced for the text.
+   * @param tokenLabelScores The per-token label scores returned by the model.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @return The decoded {@link Span spans}.
+   */
+  static List<Span> decodeSpans(String text, String[] tokens, float[][] 
tokenLabelScores,
+                                Map<Integer, String> id2Labels) {
+    return decodeSpans(text, tokens, tokenLabelScores, id2Labels, 0);
+  }
 
-                  // If the next token starts with ##, combine it with this 
token.
-                  if (toks[i + 1].startsWith(CHARS_TO_REPLACE)) {
+  /**
+   * Converts model token classifications into character {@link Span spans} in 
the original input text.
+   *
+   * <p>The ONNX model returns one score vector for each WordPiece token. This 
method applies
+   * BIO decoding, reconstructs WordPiece fragments, and then resolves the 
reconstructed text
+   * against the original sentence so that {@link 
Span#getCoveredText(CharSequence)} works with
+   * the caller's input.</p>
+   *
+   * @param text The original text passed to the model.
+   * @param tokens The WordPiece tokens produced for the text.
+   * @param tokenLabelScores The per-token label scores returned by the model.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @param searchStart The character offset in {@code text} to begin locating 
spans from. Threading
+   *     a monotonic cursor across the chunks and sentences of a single {@link 
#find(String[])} call
+   *     keeps a repeated entity surface form from being emitted twice at the 
same first occurrence.
+   * @return The decoded {@link Span spans}.
+   */
+  static List<Span> decodeSpans(String text, String[] tokens, float[][] 
tokenLabelScores,
+                                Map<Integer, String> id2Labels, int 
searchStart) {
+    return decodeSpans(text, tokens, tokenLabelScores, id2Labels, searchStart, 
text.length());
+  }
 
-                    sb.append(toks[i]).append(toks[i + 
1].replace(CHARS_TO_REPLACE, ""));
+  /**
+   * Converts model token classifications into character {@link Span spans} 
within a bounded
+   * region of the original input text.
+   *
+   * @param text The original text passed to the model.
+   * @param tokens The WordPiece tokens produced for the text.
+   * @param tokenLabelScores The per-token label scores returned by the model.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @param searchStart The first character offset in {@code text} to search.
+   * @param searchEnd The exclusive upper bound for locating reconstructed 
spans. During
+   *     {@link #find(String[])}, this is the current sentence end so an 
entity from one sentence
+   *     cannot be resolved to an identical surface form in a later sentence.
+   * @return The decoded {@link Span spans}.
+   */
+  static List<Span> decodeSpans(String text, String[] tokens, float[][] 
tokenLabelScores,
+                                Map<Integer, String> id2Labels, int 
searchStart, int searchEnd) {
 
-                    // Append a space unless the next (next) token starts with 
##.
-                    if (!toks[i + 2].startsWith(CHARS_TO_REPLACE)) {
-                      sb.append(" ");
-                    }
+    if (tokens.length != tokenLabelScores.length) {
+      throw new IllegalArgumentException("The number of tokens (" + 
tokens.length
+          + ") must match the number of model output rows (" + 
tokenLabelScores.length + ").");
+    }
 
-                    // Skip the next token since we just included it in this 
iteration.
-                    i++;
+    final List<Span> spans = new ArrayList<>();
 
-                  } else {
+    int characterStart = searchStart;
 
-                    sb.append(toks[i].replace(CHARS_TO_REPLACE, ""));
+    for (int x = 0; x < tokenLabelScores.length; x++) {
+      final LabelPrediction prediction = predictLabel(tokenLabelScores[x], 
id2Labels);
+      if (!isBeginLabel(prediction.label())) {
+        continue;
+      }
 
-                    // Append a space unless the next token is a period.
-                    if (!".".equals(toks[i + 1])) {
-                      sb.append(" ");
-                    }
+      final String entityType = 
prediction.label().substring(PREFIX_BEGIN.length());
+      final EntityPrediction entity = findEntityEnd(tokenLabelScores, x, 
id2Labels,
+          entityType, prediction.probability());
+      final String spanText = buildSpanText(tokens, x, entity.endIndex());
 
-                  }
+      if (spanText.isBlank()) {
+        x = entity.endIndex();
+        continue;
+      }
 
-                }
+      final SpanMatch match = findByRegex(text, spanText, characterStart, 
searchEnd);
+      if (match.start() != -1) {
+        spans.add(new Span(match.start(), match.end(), entityType, 
entity.probability()));
+        characterStart = match.end();
+      } else {
+        logger.debug("Unable to locate decoded {} span '{}' in source text 
region [{}, {}).",
+            entityType, spanText, characterStart, searchEnd);
+      }
 
-                // This is the text of the span. We use the whole original 
input text and not one
-                // of the splits. This gives us accurate character positions.
-                spanText = findByRegex(text, sb.toString().trim()).trim();
+      x = entity.endIndex();
+    }
 
-              } else {
+    return spans;
 
-                // This is a single-token span so there is nothing else to do 
except grab the token.
-                spanText = toks[x];
+  }
 
-              }
+  /**
+   * Finds the final token index and confidence for one BIO entity that starts 
at {@code startIndex}.
+   *
+   * <p>The span continues while subsequent predictions are {@code I-<same 
type>}. The returned
+   * probability is the minimum token probability across the entity, so a 
multi-token span reflects
+   * its weakest continuation.</p>
+   *
+   * @param tokenLabelScores The per-token label scores returned by the model.
+   * @param startIndex The token index where the entity begins.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @param entityType The entity type without its BIO prefix, for example 
{@code PER}.
+   * @param startProbability The normalized probability of the begin label.
+   * @return The last token index and probability for the entity.
+   */
+  private static EntityPrediction findEntityEnd(float[][] tokenLabelScores, 
int startIndex,
+                                                Map<Integer, String> id2Labels,
+                                                String entityType,
+                                                double startProbability) {
+
+    final String insideLabel = PREFIX_INSIDE + entityType;
+    int endIndex = startIndex;
+    double probability = startProbability;
+
+    for (int x = startIndex + 1; x < tokenLabelScores.length; x++) {
+      final LabelPrediction prediction = predictLabel(tokenLabelScores[x], 
id2Labels);
+      if (!insideLabel.equals(prediction.label())) {
+        break;
+      }
+      endIndex = x;
+      probability = Math.min(probability, prediction.probability());
+    }
 
-              if (!SEPARATOR.equals(spanText)) {
+    return new EntityPrediction(endIndex, probability);
 
-                spanText = spanText.replace(CHARS_TO_REPLACE, "");
+  }
 
-                // This ignores other potential matches in the same sentence
-                // by only taking the first occurrence.
-                characterStart = text.indexOf(spanText, characterStart);
+  /**
+   * Returns whether a label is a well-formed BIO begin label.
+   *
+   * @param label The label to inspect.
+   * @return {@code true} for {@code B-<TYPE>} labels with a non-empty type.
+   */
+  private static boolean isBeginLabel(String label) {
+    return label.startsWith(PREFIX_BEGIN) && label.length() > 
PREFIX_BEGIN.length();
+  }
 
-                // TODO: This check should not be needed because the span was 
found.
-                // If we aren't finding it now it's because there's a 
whitespace difference.
-                if (characterStart != -1) {
+  /**
+   * Picks the predicted BIO label for one token.
+   *
+   * @param scores The model scores for one token.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @return The predicted label and its normalized probability.
+   * @throws IllegalStateException Thrown if the model's argmax index is 
absent from
+   *     {@code id2Labels}, which means the label map is not exhaustive over 
the model's output
+   *     indices and the model/label-map pair is misconfigured.
+   */
+  private static LabelPrediction predictLabel(float[] scores, Map<Integer, 
String> id2Labels) {
 
-                  final int characterEnd = characterStart + spanText.length();
+    final int labelIndex = maxIndex(scores);
+    final String label = id2Labels.get(labelIndex);
+    if (label == null) {
+      throw new IllegalStateException("Model output index " + labelIndex
+          + " has no configured label; ids2Labels must map every model output 
index.");
+    }
 
-                  spans.add(new Span(characterStart, characterEnd, spanText, 
confidence));
+    return new LabelPrediction(label, labelProbability(scores, labelIndex));
 
-                  // OP-1: Only increment characterStart by one.
-                  characterStart++;
+  }
 
-                }
+  /**
+   * Normalizes model scores into a probability for one label index using a 
numerically stable
+   * softmax.
+   *
+   * @param scores The raw model scores for one token.
+   * @param labelIndex The label index whose probability should be returned.
+   * @return The normalized probability in {@code [0, 1]}.
+   */
+  static double labelProbability(float[] scores, int labelIndex) {
 
-              }
+    int positiveInfinityCount = 0;
+    double max = Float.NEGATIVE_INFINITY;
 
-            }
+    for (float score : scores) {
+      if (score == Float.POSITIVE_INFINITY) {
+        positiveInfinityCount++;
+      } else if (!Float.isNaN(score) && score > max) {
+        max = score;
+      }
+    }
 
-          }
+    if (positiveInfinityCount > 0) {
+      // From decodeSpans, labelIndex is always the argmax, so when any +Inf 
is present the chosen
+      // score is +Inf and this returns 1/(number of +Inf). The 0d arm covers 
a direct caller
+      // asking for a non-+Inf label's probability while a +Inf label exists 
(exercised by tests).
+      return scores[labelIndex] == Float.POSITIVE_INFINITY ? 1d / 
positiveInfinityCount : 0d;
+    }
 
-        } catch (OrtException ex) {
-          throw new RuntimeException("Error performing namefinder inference: " 
+ ex.getMessage(), ex);
-        }
+    if (max == Float.NEGATIVE_INFINITY) {
+      return 1d / scores.length;
+    }
 
+    double denominator = 0;
+    for (float score : scores) {
+      if (!Float.isNaN(score)) {
+        denominator += Math.exp(score - max);
       }
-
     }
 
-    return spans.toArray(new Span[0]);
+    return Math.exp(scores[labelIndex] - max) / denominator;
 
   }
 
-  @Override
-  public void clearAdaptiveData() {
-    // No use in this implementation.
-  }
-
-  private SpanEnd findSpanEnd(float[][][] v, int startIndex, Map<Integer, 
String> id2Labels,
-                              String[] tokens) {
-
-    // -1 means there is no follow-up token, so it is a single-token span.
-    int index = -1;
-    int characterEnd = 0;
-
-    // Starts at the span start in the vector.
-    // Looks at the next token to see if it is an I-PER.
-    // Go until the next token is something other than I-PER.
-    // When the next token is not I-PER, return the previous index.
-
-    for (int x = startIndex + 1; x < v[0].length; x++) {
+  /**
+   * Reconstructs source-like text from a span of WordPiece tokens.
+   *
+   * <p>Special BERT tokens are skipped, {@code ##} continuations are merged 
into the preceding
+   * surface form, and simple punctuation spacing is normalized so the result 
can be located in
+   * the caller's original text.</p>
+   *
+   * @param tokens The WordPiece token sequence.
+   * @param startIndex The first token index to include.
+   * @param endIndex The last token index to include.
+   * @return The reconstructed span text.
+   */
+  static String buildSpanText(String[] tokens, int startIndex, int endIndex) {
 
-      // Get the next item.
-      final float[] arr = v[0][x];
+    final StringBuilder span = new StringBuilder();
+    String previousToken = null;
 
-      // See if the next token has an I-PER label.
-      final String nextTokenClassification = id2Labels.get(maxIndex(arr));
+    for (int x = startIndex; x <= endIndex && x < tokens.length; x++) {
+      final String token = tokens[x];
+      if (CLS_TOKEN.equals(token) || SEPARATOR.equals(token)) {
+        continue;
+      }
 
-      if (!I_PER.equals(nextTokenClassification)) {
-        index = x - 1;
-        break;
+      final boolean subword = token.startsWith(CHARS_TO_REPLACE);
+      final String surface = subword ? 
token.substring(CHARS_TO_REPLACE.length()) : token;
+      if (surface.isEmpty()) {
+        continue;
       }
 
+      if (span.length() > 0 && !subword && shouldInsertSpace(previousToken, 
surface)) {
+        span.append(' ');
+      }
+      span.append(surface);
+      previousToken = surface;
     }
 
-    // Find where the span ends based on the tokens.
-    for (int x = 1; x <= index && x < tokens.length; x++) {
-      characterEnd += tokens[x].length();
-    }
+    return span.toString();
 
-    // Account for the number of spaces (that is the number of tokens).
-    // (One space per token.)
-    characterEnd += index - 1;
+  }
 
-    return new SpanEnd(index, characterEnd);
+  private static boolean shouldInsertSpace(String previousToken, String token) 
{
+    return previousToken != null && !hasNoSpaceBefore(token) && 
!hasNoSpaceAfter(previousToken);
+  }
 
+  private static boolean hasNoSpaceBefore(String token) {
+    return NO_SPACE_BEFORE_TOKENS.contains(token);
   }
 
-  private int maxIndex(float[] arr) {
+  private static boolean hasNoSpaceAfter(String token) {
+    return NO_SPACE_AFTER_TOKENS.contains(token);
+  }
+
+  /**
+   * Returns the index of the largest non-NaN score.
+   *
+   * @param arr The score array to inspect.
+   * @return The index of the maximum non-NaN value.
+   * @throws IllegalStateException Thrown if the model output contains no 
non-NaN score.
+   */
+  private static int maxIndex(float[] arr) {
 
     double max = Float.NEGATIVE_INFINITY;
     int index = -1;
 
     for (int x = 0; x < arr.length; x++) {
-      if (arr[x] > max) {
+      if (!Float.isNaN(arr[x]) && (index == -1 || arr[x] > max)) {
         index = x;
         max = arr[x];
       }
     }
 
+    if (index == -1) {
+      throw new IllegalStateException(
+          "Model output scores must contain at least one non-NaN value.");
+    }
+
     return index;
 
   }
 
-  private static String findByRegex(String text, String span) {
+  /**
+   * Locates reconstructed span text in a bounded region of the original input 
text.
+   *
+   * @param text The original text.
+   * @param span The reconstructed span text.
+   * @param searchStart The first character offset to search from.
+   * @param searchEnd The exclusive upper bound of the region to search.
+   * @return The matched character offsets, or {@code (-1, -1)} when the 
reconstructed text
+   *     cannot be found in the requested region.
+   */
+  private static SpanMatch findByRegex(String text, String span, int 
searchStart, int searchEnd) {
 
-    final String regex = span
-        .replaceAll(" ", "\\\\s+")
-        .replaceAll("\\)", "\\\\)")
-        .replaceAll("\\(", "\\\\(");
+    // Reconstructed span text normalizes whitespace, so match flexibly: a 
space in the span may
+    // map to any run of whitespace OR none in the source (e.g. 
punctuation/'&' inside "U.S.A",
+    // "AT&T" that wordpiece tokenization split apart). Use \s* rather than 
\s+ so such entities
+    // are still located instead of being silently dropped.
+    final String regex = Pattern.quote(span).replace(" ", "\\E\\s*\\Q");
 
     final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
     final Matcher matcher = pattern.matcher(text);
+    final int regionStart = Math.min(Math.max(searchStart, 0), text.length());
+    final int regionEnd = Math.min(Math.max(searchEnd, regionStart), 
text.length());
+    matcher.region(regionStart, regionEnd);
 
     if (matcher.find()) {
-      return matcher.group(0);
+      return new SpanMatch(matcher.start(), matcher.end());
     }
 
-    // For some reason the regex match wasn't found. Just return the original 
span.
-    return span;
+    return new SpanMatch(-1, -1);
+
+  }
+
+  private record LabelPrediction(String label, double probability) {
+  }
 
+  private record EntityPrediction(int endIndex, double probability) {
+  }
+
+  /**
+   * Character offsets for a matched span. {@code (-1, -1)} means the 
reconstructed entity text
+   * could not be located in the searched source-text region.
+   */
+  private record SpanMatch(int start, int end) {
   }
 
   private List<Tokens> tokenize(final String text) {
diff --git 
a/opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java
 
b/opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java
index 87fe18c9b..c0a8aede2 100644
--- 
a/opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java
+++ 
b/opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java
@@ -18,18 +18,31 @@
 package opennlp.dl.namefinder;
 
 import java.util.HashMap;
+import java.util.List;
 import java.util.Map;
 
 import org.junit.jupiter.api.Test;
 
 import opennlp.tools.tokenize.WordpieceTokenizer;
+import opennlp.tools.util.Span;
 
 import static org.junit.jupiter.api.Assertions.assertArrayEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertThrows;
 import static org.junit.jupiter.api.Assertions.assertTrue;
 
 public class NameFinderDLTest {
 
+  private static final Map<Integer, String> ID_TO_LABELS = Map.of(
+      0, "O",
+      1, "B-PER",
+      2, "I-PER",
+      3, "B-LOC",
+      4, "I-LOC",
+      5, "B-ORG",
+      6, "I-ORG",
+      7, "B-");
+
   private static Map<String, Integer> vocab() {
     final Map<String, Integer> vocab = new HashMap<>();
     vocab.put(WordpieceTokenizer.BERT_CLS_TOKEN, 0);
@@ -57,4 +70,359 @@ public class NameFinderDLTest {
     assertTrue(e.getMessage().contains("missing"),
         "the error message should name the missing token: " + e.getMessage());
   }
+
+  @Test
+  void testDecodeSpansUsesBioEntityTypesAndBoundedProbabilities() {
+    final String text = "Alice visited New York City.";
+    final String[] tokens = {"[CLS]", "Alice", "visited", "New", "York", 
"City", ".", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(1), scoresFor(0), scoresFor(3), scoresFor(4), 
scoresFor(4),
+        scoresFor(0), scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(2, spans.size());
+
+    final Span person = spans.get(0);
+    assertEquals("PER", person.getType());
+    assertEquals("Alice", person.getCoveredText(text));
+    assertProbability(person);
+
+    final Span location = spans.get(1);
+    assertEquals("LOC", location.getType());
+    assertEquals("New York City", location.getCoveredText(text));
+    assertProbability(location);
+  }
+
+  @Test
+  void testDecodeSpansReconstructsWordpiecesAndEscapedPunctuation() {
+    final String text = "Acme (UK) hired Sarah Connor.";
+    final String[] tokens = {"[CLS]", "Acme", "(", "UK", ")", "hired", 
"Sarah", "Con",
+        "##nor", ".", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(5), scoresFor(6), scoresFor(6), scoresFor(6), 
scoresFor(0),
+        scoresFor(1), scoresFor(2), scoresFor(2), scoresFor(0), scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(2, spans.size());
+    assertEquals("ORG", spans.get(0).getType());
+    assertEquals("Acme (UK)", spans.get(0).getCoveredText(text));
+    assertEquals("PER", spans.get(1).getType());
+    assertEquals("Sarah Connor", spans.get(1).getCoveredText(text));
+  }
+
+  @Test
+  void testDecodeSpansIgnoresMalformedBeginLabels() {
+    final String text = "Alice visited.";
+    final String[] tokens = {"[CLS]", "Alice", "visited", ".", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(7), scoresFor(0), scoresFor(0), scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertTrue(spans.isEmpty());
+  }
+
+  @Test
+  void testDecodeSpansRejectsMissingPredictedLabels() {
+    final String text = "Alice visited.";
+    final String[] tokens = {"[CLS]", "Alice", "visited", ".", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(1), scoresFor(0), scoresFor(0), scoresFor(0)
+    };
+    final Map<Integer, String> incompleteLabels = Map.of(0, "O");
+
+    final IllegalStateException e = assertThrows(IllegalStateException.class, 
() ->
+        NameFinderDL.decodeSpans(text, tokens, scores, incompleteLabels));
+
+    assertTrue(e.getMessage().contains("1"),
+        "the error message should name the missing label id: " + 
e.getMessage());
+  }
+
+  @Test
+  void testDecodeSpansSearchStartLocatesNextOccurrence() {
+    // "Paris" appears twice. Threading the cursor past the first occurrence 
(as find() does
+    // across chunks/sentences) locates the second one instead of re-emitting 
the first, so a
+    // repeated entity is not duplicated at the same offset.
+    final String text = "Paris and Paris";
+    final String[] tokens = {"[CLS]", "Paris", "[SEP]"};
+    final float[][] scores = {scoresFor(0), scoresFor(3), scoresFor(0)};
+
+    final List<Span> first = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS, 0);
+    assertEquals(1, first.size());
+    assertEquals(0, first.get(0).getStart());
+    assertEquals(5, first.get(0).getEnd());
+
+    final List<Span> next = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS,
+        first.get(0).getEnd());
+    assertEquals(1, next.size());
+    assertEquals(10, next.get(0).getStart());
+    assertEquals(15, next.get(0).getEnd());
+    assertEquals("Paris", next.get(0).getCoveredText(text));
+  }
+
+  @Test
+  void testDecodeSpansLocatesEntityWithInternalPunctuation() {
+    // WordPiece splits "AT&T" into separate AT / & / T tokens, so the 
reconstructed span text
+    // ("AT & T") must still be located in the contiguous source. Regression 
guard for the
+    // flexible-whitespace (\s*) matching in findByRegex.
+    final String text = "Buy AT&T stock";
+    final String[] tokens = {"[CLS]", "Buy", "AT", "&", "T", "stock", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(0), scoresFor(5), scoresFor(6), scoresFor(6),
+        scoresFor(0), scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(1, spans.size());
+    assertEquals("ORG", spans.get(0).getType());
+    assertEquals("AT&T", spans.get(0).getCoveredText(text));
+  }
+
+  @Test
+  void testDecodeSpansDoesNotMatchBeyondSearchEnd() {
+    final String text = "London was quiet. Later Paris was loud.";
+    final String[] tokens = {"[CLS]", "Paris", "[SEP]"};
+    final float[][] scores = {scoresFor(0), scoresFor(3), scoresFor(0)};
+
+    final List<Span> spans = NameFinderDL.decodeSpans(
+        text, tokens, scores, ID_TO_LABELS, 0, text.indexOf(" Later"));
+
+    assertTrue(spans.isEmpty());
+  }
+
+  @Test
+  void testDecodeSpansMatchesSourceCaseInsensitively() {
+    // The reconstructed span text may differ in case from the source (e.g. an 
uncased model);
+    // findByRegex matches case-insensitively, so the span is still located at 
the source offsets.
+    final String text = "Visit PARIS today";
+    final String[] tokens = {"[CLS]", "Visit", "paris", "today", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(0), scoresFor(3), scoresFor(0), scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(1, spans.size());
+    assertEquals("LOC", spans.get(0).getType());
+    assertEquals("PARIS", spans.get(0).getCoveredText(text));
+  }
+
+  @Test
+  void testDecodeSpansSkipsNaNAndPicksLargestFinite() {
+    final String text = "Alice visited.";
+    final String[] tokens = {"[CLS]", "Alice", "visited", ".", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresWithNaN(1), scoresFor(0), scoresFor(0), 
scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(1, spans.size());
+    assertEquals("Alice", spans.get(0).getCoveredText(text));
+  }
+
+  @Test
+  void testDecodeSpansRejectsAllNaNOrEmptyScores() {
+    final String text = "Alice visited.";
+    final String[] tokens = {"[CLS]", "Alice", "visited", ".", "[SEP]"};
+
+    assertThrows(IllegalStateException.class, () -> 
NameFinderDL.decodeSpans(text, tokens,
+        new float[][] {scoresFor(0), new float[] {Float.NaN, Float.NaN}, 
scoresFor(0),
+            scoresFor(0), scoresFor(0)}, ID_TO_LABELS));
+    assertThrows(IllegalStateException.class, () -> 
NameFinderDL.decodeSpans(text, tokens,
+        new float[][] {scoresFor(0), new float[0], scoresFor(0), scoresFor(0), 
scoresFor(0)},
+        ID_TO_LABELS));
+  }
+
+  @Test
+  void testDecodeSpansRejectsTokenScoreCountMismatch() {
+    // Fewer score rows than tokens is a model/tokenizer contract violation; 
the message must name
+    // both counts so the mismatch is debuggable.
+    final String text = "Alice visited.";
+    final String[] tokens = {"[CLS]", "Alice", "visited", ".", "[SEP]"};
+    final float[][] scores = {scoresFor(0), scoresFor(1)};
+
+    final IllegalArgumentException e = 
assertThrows(IllegalArgumentException.class, () ->
+        NameFinderDL.decodeSpans(text, tokens, scores, ID_TO_LABELS));
+
+    assertTrue(e.getMessage().contains("5") && e.getMessage().contains("2"),
+        "the error message should name both counts: " + e.getMessage());
+  }
+
+  @Test
+  void testDecodeSpansIgnoresInsideLabelWithoutBegin() {
+    // An I-LOC with no preceding B-LOC is not a valid span start and must not 
emit an entity.
+    final String text = "Visit Paris today";
+    final String[] tokens = {"[CLS]", "Visit", "Paris", "today", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(0), scoresFor(4), scoresFor(0), scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertTrue(spans.isEmpty());
+  }
+
+  @Test
+  void testDecodeSpansSeparatesAdjacentEntitiesOfDifferentTypes() {
+    // B-PER directly followed by B-LOC must yield two distinct single-token 
spans, not one merged
+    // span: findEntityEnd stops at the type change and the outer loop resumes 
at the next begin.
+    final String text = "Alice Paris";
+    final String[] tokens = {"[CLS]", "Alice", "Paris", "[SEP]"};
+    final float[][] scores = {scoresFor(0), scoresFor(1), scoresFor(3), 
scoresFor(0)};
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(2, spans.size());
+    assertEquals("PER", spans.get(0).getType());
+    assertEquals("Alice", spans.get(0).getCoveredText(text));
+    assertEquals("LOC", spans.get(1).getType());
+    assertEquals("Paris", spans.get(1).getCoveredText(text));
+  }
+
+  @Test
+  void testMultiTokenSpanProbabilityIsWeakestTokenProbability() {
+    // The probability of a multi-token entity is the minimum across its 
tokens, so a confident
+    // begin followed by a weak continuation reports the weak continuation's 
probability.
+    final String text = "New York";
+    final String[] tokens = {"[CLS]", "New", "York", "[SEP]"};
+    final float[] strongBegin = scoresFor(3);
+    final float[] weakInside = weakScoresFor(4);
+    final float[][] scores = {scoresFor(0), strongBegin, weakInside, 
scoresFor(0)};
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(1, spans.size());
+    assertEquals("New York", spans.get(0).getCoveredText(text));
+    assertEquals(NameFinderDL.labelProbability(weakInside, 4), 
spans.get(0).getProb(), 1e-9);
+    assertTrue(spans.get(0).getProb() < 
NameFinderDL.labelProbability(strongBegin, 3),
+        "multi-token span should reflect its weakest continuation");
+  }
+
+  @Test
+  void testDecodeSpansEmitsRepeatedEntityAtDistinctOffsets() {
+    // Two identical surface forms within a single call must resolve to 
distinct, non-overlapping
+    // spans via the internal monotonic cursor rather than both matching the 
first occurrence.
+    final String text = "Paris and Paris";
+    final String[] tokens = {"[CLS]", "Paris", "and", "Paris", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(3), scoresFor(0), scoresFor(3), scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(2, spans.size());
+    assertEquals(0, spans.get(0).getStart());
+    assertEquals(5, spans.get(0).getEnd());
+    assertEquals(10, spans.get(1).getStart());
+    assertEquals(15, spans.get(1).getEnd());
+  }
+
+  @Test
+  void testDecodeSpansLocatesEntityWithRegexMetacharacters() {
+    // WordPiece splits "C++" into C / + / + tokens, so the reconstructed span 
text contains regex
+    // metacharacters. Pattern.quote must treat them literally (not as 
quantifiers) for the entity
+    // to be located in the source.
+    final String text = "Love C++ today";
+    final String[] tokens = {"[CLS]", "Love", "C", "+", "+", "today", "[SEP]"};
+    final float[][] scores = {
+        scoresFor(0), scoresFor(0), scoresFor(5), scoresFor(6), scoresFor(6),
+        scoresFor(0), scoresFor(0)
+    };
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS);
+
+    assertEquals(1, spans.size());
+    assertEquals("ORG", spans.get(0).getType());
+    assertEquals("C++", spans.get(0).getCoveredText(text));
+  }
+
+  @Test
+  void testDecodeSpansClampsSearchStartBeyondText() {
+    // A searchStart past the end of the text must clamp to an empty region 
and yield no match
+    // rather than throwing an out-of-bounds error.
+    final String text = "Paris";
+    final String[] tokens = {"[CLS]", "Paris", "[SEP]"};
+    final float[][] scores = {scoresFor(0), scoresFor(3), scoresFor(0)};
+
+    final List<Span> spans = NameFinderDL.decodeSpans(text, tokens, scores, 
ID_TO_LABELS, 999);
+
+    assertTrue(spans.isEmpty());
+  }
+
+  @Test
+  void testLabelProbabilityIsBoundedStableSoftmax() {
+    // Reference (numpy): softmax([1,2,3])[2] = 0.66524096.
+    final double p = NameFinderDL.labelProbability(new float[] {1f, 2f, 3f}, 
2);
+    assertEquals(0.66524096, p, 1e-6);
+    assertBounded(p);
+  }
+
+  @Test
+  void testLabelProbabilityHandlesPositiveInfinity() {
+    // Two +Inf logits split the mass; a finite logit alongside them gets zero.
+    final float[] scores = {Float.POSITIVE_INFINITY, 0f, 
Float.POSITIVE_INFINITY};
+    assertEquals(0.5, NameFinderDL.labelProbability(scores, 0), 1e-9);
+    assertEquals(0.0, NameFinderDL.labelProbability(scores, 1), 1e-9);
+    assertBounded(NameFinderDL.labelProbability(scores, 0));
+  }
+
+  @Test
+  void testLabelProbabilityHandlesAllNegativeInfinity() {
+    // No finite score: fall back to a uniform distribution rather than 
producing NaN.
+    final double p = NameFinderDL.labelProbability(
+        new float[] {Float.NEGATIVE_INFINITY, Float.NEGATIVE_INFINITY}, 0);
+    assertEquals(0.5, p, 1e-9);
+    assertBounded(p);
+  }
+
+  @Test
+  void testLabelProbabilityIgnoresNaNInDenominator() {
+    // A NaN logit must not poison the normalization of the finite ones.
+    final double p = NameFinderDL.labelProbability(new float[] {0f, Float.NaN, 
0f}, 0);
+    assertEquals(0.5, p, 1e-9);
+    assertBounded(p);
+  }
+
+  private static float[] scoresFor(int labelIndex) {
+    final float[] scores = new float[ID_TO_LABELS.size()];
+    for (int i = 0; i < scores.length; i++) {
+      scores[i] = -5;
+    }
+    scores[labelIndex] = 5;
+    return scores;
+  }
+
+  private static float[] scoresWithNaN(int labelIndex) {
+    final float[] scores = scoresFor(labelIndex);
+    scores[0] = Float.NaN;
+    return scores;
+  }
+
+  // Lower-margin scores than scoresFor, so the chosen label's softmax 
probability is well below 1
+  // and a multi-token span's minimum-probability behavior is observable.
+  private static float[] weakScoresFor(int labelIndex) {
+    final float[] scores = new float[ID_TO_LABELS.size()];
+    for (int i = 0; i < scores.length; i++) {
+      scores[i] = -1;
+    }
+    scores[labelIndex] = 1;
+    return scores;
+  }
+
+  private static void assertProbability(Span span) {
+    assertTrue(span.getProb() > 0 && span.getProb() <= 1,
+        "span probability should be normalized to (0, 1]: " + span.getProb());
+  }
+
+  private static void assertBounded(double probability) {
+    assertTrue(probability >= 0 && probability <= 1,
+        "probability must be within [0, 1]: " + probability);
+  }
 }
diff --git 
a/opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java 
b/opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java
index 553c31590..e19742b96 100644
--- 
a/opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java
+++ 
b/opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java
@@ -69,12 +69,23 @@ public class NameFinderDLEval extends AbstractEvalTest {
         logger.debug(span.toString());
       }
 
-      Assertions.assertEquals(1, spans.length);
+      final String text = String.join(" ", tokens);
+
+      // The model emits a PER and a LOC entity; the person-only decoder 
previously dropped
+      // the location. Span types are the entity labels (PER/LOC), not the 
matched text.
+      Assertions.assertEquals(2, spans.length);
+
+      Assertions.assertEquals("PER", spans[0].getType());
       Assertions.assertEquals(0, spans[0].getStart());
       Assertions.assertEquals(17, spans[0].getEnd());
-      Assertions.assertEquals(8.251646041870117, spans[0].getProb(), 0.00001);
-      Assertions.assertEquals("George Washington",
-          spans[0].getCoveredText(String.join(" ", tokens)));
+      Assertions.assertEquals("George Washington", 
spans[0].getCoveredText(text));
+      Assertions.assertTrue(spans[0].getProb() > 0 && spans[0].getProb() <= 1);
+
+      Assertions.assertEquals("LOC", spans[1].getType());
+      Assertions.assertEquals(39, spans[1].getStart());
+      Assertions.assertEquals(52, spans[1].getEnd());
+      Assertions.assertEquals("United States", spans[1].getCoveredText(text));
+      Assertions.assertTrue(spans[1].getProb() > 0 && spans[1].getProb() <= 1);
     }
 
   }
@@ -113,10 +124,16 @@ public class NameFinderDLEval extends AbstractEvalTest {
             startGate.await();
             for (int i = 0; i < iterationsPerThread; i++) {
               final Span[] spans = nameFinderDL.find(tokens);
-              if (spans.length != 1
+              // The all-entity decoder yields both the PER and the LOC span 
for this input.
+              if (spans.length != 2
                   || spans[0].getStart() != 0
                   || spans[0].getEnd() != 17
-                  || !"George 
Washington".equals(spans[0].getCoveredText(text))) {
+                  || !"PER".equals(spans[0].getType())
+                  || !"George Washington".equals(spans[0].getCoveredText(text))
+                  || spans[1].getStart() != 39
+                  || spans[1].getEnd() != 52
+                  || !"LOC".equals(spans[1].getType())
+                  || !"United States".equals(spans[1].getCoveredText(text))) {
                 return false;
               }
             }
@@ -151,6 +168,7 @@ public class NameFinderDLEval extends AbstractEvalTest {
 
     final String[] tokens = new String[]
         {"George", "Washington", "was", "president", "of", "the", "United", 
"States", "."};
+    final String text = String.join(" ", tokens);
 
     // Explicitly construct the detector inside the test to make the 
precondition visible.
     final SentenceDetectorME detector = new SentenceDetectorME("en");
@@ -171,9 +189,16 @@ public class NameFinderDLEval extends AbstractEvalTest {
             startGate.await();
             for (int i = 0; i < iterationsPerThread; i++) {
               final Span[] spans = nameFinderDL.find(tokens);
-              if (spans.length != 1
+              // The all-entity decoder yields both the PER and the LOC span 
for this input.
+              if (spans.length != 2
                   || spans[0].getStart() != 0
-                  || spans[0].getEnd() != 17) {
+                  || spans[0].getEnd() != 17
+                  || !"PER".equals(spans[0].getType())
+                  || !"George Washington".equals(spans[0].getCoveredText(text))
+                  || spans[1].getStart() != 39
+                  || spans[1].getEnd() != 52
+                  || !"LOC".equals(spans[1].getType())
+                  || !"United States".equals(spans[1].getCoveredText(text))) {
                 return false;
               }
             }
@@ -213,8 +238,9 @@ public class NameFinderDLEval extends AbstractEvalTest {
         options, sentenceDetector)) {
 
       final Span[] baseline = nameFinderDL.find(tokens);
-      Assertions.assertEquals(1, baseline.length);
+      Assertions.assertEquals(2, baseline.length);
       Assertions.assertEquals("George Washington", 
baseline[0].getCoveredText(text));
+      Assertions.assertEquals("United States", 
baseline[1].getCoveredText(text));
 
       // Mutate the options in ways that would change inference if they were 
read live:
       // a split size of 1 would chunk the input one word at a time.
@@ -224,11 +250,14 @@ public class NameFinderDLEval extends AbstractEvalTest {
       options.setSplitOverlapSize(0);
 
       final Span[] afterMutation = nameFinderDL.find(tokens);
-      Assertions.assertEquals(1, afterMutation.length,
+      Assertions.assertEquals(2, afterMutation.length,
           "mutating InferenceOptions after construction must not affect a 
built instance");
       Assertions.assertEquals(0, afterMutation[0].getStart());
       Assertions.assertEquals(17, afterMutation[0].getEnd());
       Assertions.assertEquals("George Washington", 
afterMutation[0].getCoveredText(text));
+      Assertions.assertEquals(39, afterMutation[1].getStart());
+      Assertions.assertEquals(52, afterMutation[1].getEnd());
+      Assertions.assertEquals("United States", 
afterMutation[1].getCoveredText(text));
     }
 
   }
@@ -253,8 +282,11 @@ public class NameFinderDLEval extends AbstractEvalTest {
       }
 
       Assertions.assertEquals(1, spans.length);
+      Assertions.assertEquals("PER", spans[0].getType());
       Assertions.assertEquals(13, spans[0].getStart());
       Assertions.assertEquals(30, spans[0].getEnd());
+      Assertions.assertEquals("George Washington",
+          spans[0].getCoveredText(String.join(" ", tokens)));
     }
   }
 
@@ -278,8 +310,10 @@ public class NameFinderDLEval extends AbstractEvalTest {
       }
 
       Assertions.assertEquals(1, spans.length);
+      Assertions.assertEquals("PER", spans[0].getType());
       Assertions.assertEquals(13, spans[0].getStart());
       Assertions.assertEquals(19, spans[0].getEnd());
+      Assertions.assertEquals("George", spans[0].getCoveredText(String.join(" 
", tokens)));
     }
   }
 
@@ -342,11 +376,17 @@ public class NameFinderDLEval extends AbstractEvalTest {
         logger.debug(span.toString());
       }
 
+      final String text = String.join(" ", tokens);
+
       Assertions.assertEquals(2, spans.length);
+      Assertions.assertEquals("PER", spans[0].getType());
       Assertions.assertEquals(0, spans[0].getStart());
       Assertions.assertEquals(17, spans[0].getEnd());
+      Assertions.assertEquals("George Washington", 
spans[0].getCoveredText(text));
+      Assertions.assertEquals("PER", spans[1].getType());
       Assertions.assertEquals(22, spans[1].getStart());
       Assertions.assertEquals(37, spans[1].getEnd());
+      Assertions.assertEquals("Abraham Lincoln", 
spans[1].getCoveredText(text));
 
     }

(opennlp) branch main updated: OPENNLP-1846: Fix NameFinderDL only worked with Person, expand to all types (#1086)

Reply via email to