This is an automated email from the ASF dual-hosted git repository.
krickert pushed a commit to branch OPENNLP-1833-grpc-expansion
in repository https://gitbox.apache.org/repos/asf/opennlp-sandbox.git
The following commit(s) were added to refs/heads/OPENNLP-1833-grpc-expansion by
this push:
new cd64ff61 OPENNLP-1833: Add embeddings, segmentation and semantic
chunking to the gRPC server
cd64ff61 is described below
commit cd64ff6171381607864dcb91ac3c9429d9d9e943
Author: Kristian Rickert <[email protected]>
AuthorDate: Wed Jun 10 08:38:57 2026 -0400
OPENNLP-1833: Add embeddings, segmentation and semantic chunking to the
gRPC server
Adds an EmbeddingProvider abstraction with ONNX Runtime CPU and CUDA
implementations behind a strict factory (model.embedder.backend=onnx|cuda).
Embedding models are declared per id in the server config, loaded eagerly,
and the dimension is read from the ONNX session metadata. The embedder uses
the standard single-segment BERT encoding, maps OOV tokens to the unknown
id,
truncates at 512 wordpieces, only sends inputs the model declares, and
closes
all native resources deterministically.
Chunking adds sentence, token-window and semantic algorithms, wired through
chunk_embed_configs and PIPELINE_STEP_CHUNK with per-chunk embeddings and
group statistics. Semantic chunking places boundaries on
consecutive-sentence
cosine similarity with percentile or fixed thresholds and min/max size
constraints.
The GPU build (-Dgpu) swaps onnxruntime for onnxruntime_gpu so the CPU and
CUDA runtimes never coexist on the classpath. Covered by unit tests for the
providers, factory, chunkers and analyzer paths; README updated.
---
opennlp-grpc/README.md | 96 +++++++-
opennlp-grpc/opennlp-grpc-service/pom.xml | 60 +++++
.../opennlp/grpc/chunk/ChunkEmbedProcessor.java | 256 ++++++++++++++++++++
.../opennlp/grpc/chunk/SegmentationChunker.java | 169 +++++++++++++
.../apache/opennlp/grpc/chunk/SemanticChunker.java | 219 +++++++++++++++++
.../embedding/AbstractOnnxEmbeddingProvider.java | 269 +++++++++++++++++++++
.../grpc/embedding/CudaEmbeddingProvider.java | 51 ++++
.../opennlp/grpc/embedding/EmbeddingProvider.java | 98 ++++++++
.../grpc/embedding/EmbeddingProviderFactory.java | 63 +++++
.../embedding/OnnxRuntimeEmbeddingProvider.java | 46 ++++
.../grpc/embedding/OnnxSentenceEmbedder.java | 211 ++++++++++++++++
.../opennlp/grpc/model/ModelBundleCache.java | 46 +++-
.../grpc/processor/BasicDocumentAnalyzer.java | 255 ++++++++++++++++---
.../opennlp/grpc/processor/PipelineStepPolicy.java | 4 +-
.../opennlp/grpc/server/OpenNlpGrpcServer.java | 5 +-
.../chunk/ChunkEmbedProcessorSemanticTest.java | 80 ++++++
.../grpc/chunk/SegmentationChunkerTest.java | 104 ++++++++
.../opennlp/grpc/chunk/SemanticChunkerTest.java | 151 ++++++++++++
.../embedding/EmbeddingProviderFactoryTest.java | 64 +++++
.../grpc/embedding/StubEmbeddingProvider.java | 89 +++++++
.../BasicDocumentAnalyzerChunkEmbedTest.java | 106 ++++++++
.../BasicDocumentAnalyzerEmbeddingTest.java | 91 +++++++
.../processor/BasicDocumentAnalyzerPolicyTest.java | 36 ++-
.../BasicDocumentAnalyzerSemanticChunkTest.java | 88 +++++++
24 files changed, 2614 insertions(+), 43 deletions(-)
diff --git a/opennlp-grpc/README.md b/opennlp-grpc/README.md
index cae50bef..34b32d67 100644
--- a/opennlp-grpc/README.md
+++ b/opennlp-grpc/README.md
@@ -60,10 +60,98 @@ By default no configuration is required: the server loads
the bundled English
sentence-detector and tokenizer from the classpath.
> v1 note: this minimal slice implements sentence detection, tokenization,
-> probability reporting, `max_text_length`, offset encoding selection, and the
-> default `en-basic` model bundle. Unsupported backends, ONNX embedding model
-> selection, non-default bundles, and chunk/embed configs are rejected
explicitly
-> instead of being silently ignored.
+> sentence-level embeddings (when ONNX models are configured), segmentation
chunking
+> (`sentence` and `token` algorithms via `chunk_embed_configs` or
`PIPELINE_STEP_CHUNK`),
+> probability reporting, `max_text_length`, offset encoding selection, and the
default
+> `en-basic` model bundle. Semantic chunking (`algorithm: semantic`), CPU/GPU
ONNX
+> embeddings, and segmentation chunking are supported when models are
configured.
+> OpenVINO, classic syntactic `ChunkerME`, non-default bundles, and per-entry
chunk
+> profiles are rejected explicitly instead of being silently ignored.
+
+### Embedding models (optional)
+
+Register ONNX sentence-transformer models in the server config:
+
+```ini
+model.embedder.default_id=sentence-transformers
+model.embedder.sentence-transformers.onnx.path=/path/to/model.onnx
+model.embedder.sentence-transformers.vocab.path=/path/to/vocab.txt
+```
+
+Request embeddings by adding `PIPELINE_STEP_EMBED` to the analysis profile and
+setting `options.onnx_embedding_model_id` (or rely on `default_id` when only
+one model is registered). Uses ONNX Runtime via `opennlp-dl` on CPU by default.
+
+#### GPU embeddings (optional)
+
+Build with the GPU flavor, which replaces the `onnxruntime` jar with
+`onnxruntime_gpu` (exactly one of the two is ever on the classpath), and point
+the server at CUDA:
+
+```bash
+mvn -pl opennlp-grpc/opennlp-grpc-service -Dgpu package
+```
+
+```ini
+model.embedder.backend=cuda
+model.embedder.gpu_device_id=0
+model.embedder.default_id=sentence-transformers
+model.embedder.sentence-transformers.onnx.path=/path/to/model.onnx
+model.embedder.sentence-transformers.vocab.path=/path/to/vocab.txt
+```
+
+`model.embedder.backend` accepts `onnx` (default, CPU) or `cuda`; any other
value
+is rejected at startup. `model.embedder.gpu_device_id` is only valid with the
+`cuda` backend. Clients should set `inference_backend` to
`INFERENCE_BACKEND_CUDA`
+(or legacy `INFERENCE_BACKEND_ONNX_RUNTIME_GPU`) when requesting embeddings or
+chunk embeddings. Requires an NVIDIA CUDA runtime on the host.
+
+### Chunk + embed configs
+
+Request one or more chunking strategies with per-chunk embeddings:
+
+```json
+{
+ "chunk_embed_configs": [
+ {
+ "config_id": "sentence-chunks",
+ "chunking": { "algorithm": "sentence" },
+ "embedding_model_ids": ["sentence-transformers"]
+ },
+ {
+ "config_id": "token-chunks",
+ "chunking": { "algorithm": "token", "chunk_size": 128, "chunk_overlap":
16 },
+ "embedding_model_ids": ["sentence-transformers"]
+ }
+ ]
+}
+```
+
+The server auto-runs sentence detection (and tokenization for `token` windows)
once,
+then returns each strategy as a `chunk_embedding_groups` entry with embeddings
+attached inside each chunk.
+
+#### Semantic chunking
+
+Topic-boundary chunking compares consecutive sentence embeddings and splits
when
+cosine similarity drops below `semantic_config.similarity_threshold` (default
`0.5`)
+or below the configured `percentile_threshold`. Example:
+
+```json
+{
+ "config_id": "semantic-topics",
+ "chunking": {
+ "algorithm": "semantic",
+ "semantic_config": {
+ "similarity_threshold": 0.75,
+ "min_chunk_sentences": 1,
+ "max_chunk_sentences": 8,
+ "semantic_embedding_model_id": "sentence-transformers"
+ }
+ },
+ "embedding_model_ids": ["sentence-transformers"]
+}
+```
## v1 API
diff --git a/opennlp-grpc/opennlp-grpc-service/pom.xml
b/opennlp-grpc/opennlp-grpc-service/pom.xml
index 4e2400fb..7b559fb8 100644
--- a/opennlp-grpc/opennlp-grpc-service/pom.xml
+++ b/opennlp-grpc/opennlp-grpc-service/pom.xml
@@ -30,6 +30,11 @@
<artifactId>opennlp-grpc-service</artifactId>
<name>Apache OpenNLP gRPC Server</name>
+ <properties>
+ <!-- Must match the onnxruntime version managed by opennlp-dl
${opennlp.version}. -->
+ <onnxruntime.version>1.25.0</onnxruntime.version>
+ </properties>
+
<dependencies>
<dependency>
<groupId>org.apache.opennlp</groupId>
@@ -55,6 +60,23 @@
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-model-resolver</artifactId>
</dependency>
+ <!--
+ ONNX embeddings support (Maven Central artifact, not the sandbox
DL4J module).
+ The onnxruntime jar is excluded here and supplied by exactly one of
the
+ cpu/gpu profiles below, so the CPU and CUDA runtimes (which ship the
same
+ ai.onnxruntime classes) can never coexist on the classpath.
+ -->
+ <dependency>
+ <groupId>org.apache.opennlp</groupId>
+ <artifactId>opennlp-dl</artifactId>
+ <version>${opennlp.version}</version>
+ <exclusions>
+ <exclusion>
+ <groupId>com.microsoft.onnxruntime</groupId>
+ <artifactId>onnxruntime</artifactId>
+ </exclusion>
+ </exclusions>
+ </dependency>
<dependency>
<groupId>org.apache.opennlp</groupId>
@@ -131,6 +153,44 @@
</dependencies>
+ <!--
+ Exactly one ONNX Runtime flavor is active at a time. The default cpu
profile
+ is replaced by the gpu profile when building with -Dgpu (mirrors the
+ opennlp-dl / opennlp-dl-gpu split in the main OpenNLP repository).
+ -->
+ <profiles>
+ <profile>
+ <id>cpu</id>
+ <activation>
+ <property>
+ <name>!gpu</name>
+ </property>
+ </activation>
+ <dependencies>
+ <dependency>
+ <groupId>com.microsoft.onnxruntime</groupId>
+ <artifactId>onnxruntime</artifactId>
+ <version>${onnxruntime.version}</version>
+ </dependency>
+ </dependencies>
+ </profile>
+ <profile>
+ <id>gpu</id>
+ <activation>
+ <property>
+ <name>gpu</name>
+ </property>
+ </activation>
+ <dependencies>
+ <dependency>
+ <groupId>com.microsoft.onnxruntime</groupId>
+ <artifactId>onnxruntime_gpu</artifactId>
+ <version>${onnxruntime.version}</version>
+ </dependency>
+ </dependencies>
+ </profile>
+ </profiles>
+
<build>
<plugins>
<plugin>
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/ChunkEmbedProcessor.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/ChunkEmbedProcessor.java
new file mode 100644
index 00000000..74140bac
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/ChunkEmbedProcessor.java
@@ -0,0 +1,256 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.chunk;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Set;
+
+import org.apache.opennlp.grpc.embedding.EmbeddingProvider;
+import org.apache.opennlp.grpc.processor.AnalysisException;
+import org.apache.opennlp.grpc.v1.AnnotatedSentence;
+import org.apache.opennlp.grpc.v1.AnnotationSpan;
+import org.apache.opennlp.grpc.v1.Chunk;
+import org.apache.opennlp.grpc.v1.ChunkEmbedConfigEntry;
+import org.apache.opennlp.grpc.v1.ChunkEmbeddingGroup;
+import org.apache.opennlp.grpc.v1.ChunkGroupStats;
+import org.apache.opennlp.grpc.v1.ChunkingSpec;
+import org.apache.opennlp.grpc.v1.CoordinateSpace;
+import org.apache.opennlp.grpc.v1.DiagnosticSeverity;
+import org.apache.opennlp.grpc.v1.EmbeddingGranularity;
+import org.apache.opennlp.grpc.v1.EmbeddingResult;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.PipelineStep;
+import org.apache.opennlp.grpc.v1.ProcessingDiagnostic;
+
+/**
+ * Builds {@link ChunkEmbeddingGroup} results from {@link
ChunkEmbedConfigEntry} requests.
+ */
+public final class ChunkEmbedProcessor {
+
+ private ChunkEmbedProcessor() {
+ }
+
+ /**
+ * Validates a chunk+embed config entry against the server's capabilities
before any
+ * processing starts, so invalid requests fail without partial results.
+ *
+ * @param entry The config entry to validate.
+ * @param embeddingProvider The provider whose registered models are checked.
+ *
+ * @throws AnalysisException If the entry is incomplete, references unknown
embedding
+ * models, or requires features this server does
not provide.
+ */
+ public static void validateEntry(ChunkEmbedConfigEntry entry,
EmbeddingProvider embeddingProvider) {
+ if (entry.getConfigId().isBlank()) {
+ throw AnalysisException.invalidArgument("chunk_embed_configs.config_id
is required");
+ }
+ if (entry.hasProfile()) {
+ throw AnalysisException.unimplemented(
+ "per-entry analysis profiles in chunk_embed_configs are not
implemented");
+ }
+ if (!entry.hasChunking()) {
+ throw AnalysisException.invalidArgument(
+ "chunk_embed_configs.chunking is required for config '" +
entry.getConfigId() + "'");
+ }
+ final ChunkingSpec chunking = entry.getChunking();
+ if (isSemantic(chunking)) {
+ validateSemanticChunking(entry);
+ if (!embeddingProvider.isAvailable()) {
+ throw AnalysisException.notFound(
+ "semantic chunking for config '" + entry.getConfigId()
+ + "' requires configured embedding models on this server");
+ }
+ }
+ if (entry.getEmbeddingModelIdsCount() > 0 &&
!embeddingProvider.isAvailable()) {
+ throw AnalysisException.notFound(
+ "embedding models requested for config '" + entry.getConfigId()
+ + "' but no embedding models are configured on this server");
+ }
+ for (String modelId : entry.getEmbeddingModelIdsList()) {
+ if (!embeddingProvider.supportsModel(modelId)) {
+ throw AnalysisException.notFound("Unknown embedding model '" + modelId
+ "'");
+ }
+ }
+ }
+
+ /**
+ * Chunks the document according to the entry's chunking spec and embeds
every chunk
+ * with each requested embedding model.
+ *
+ * @param rawText The document text the annotation offsets refer
to.
+ * @param document The analyzed document backbone.
+ * @param entry A previously validated config entry.
+ * @param embeddingProvider The provider used for chunk embeddings and
semantic chunking.
+ *
+ * @return The resulting chunk group including per-group statistics.
+ */
+ public static ChunkEmbeddingGroup buildGroup(
+ String rawText,
+ OpenNlpDocument document,
+ ChunkEmbedConfigEntry entry,
+ EmbeddingProvider embeddingProvider) {
+ final long started = System.currentTimeMillis();
+ final List<SegmentationChunker.ChunkSegment> segments =
+ SegmentationChunker.segment(rawText, document, entry.getChunking(),
embeddingProvider);
+
+ final ChunkEmbeddingGroup.Builder group = ChunkEmbeddingGroup.newBuilder()
+ .setGroupId(entry.getConfigId())
+ .setChunkConfigId(entry.getConfigId())
+ .addAllEmbeddingModelIds(entry.getEmbeddingModelIdsList())
+
.setGranularity(EmbeddingGranularity.EMBEDDING_GRANULARITY_CHUNK_LEVEL);
+ if (entry.hasResultSetName()) {
+ group.setResultSetName(entry.getResultSetName());
+ }
+
+ int totalTokens = 0;
+ for (SegmentationChunker.ChunkSegment segment : segments) {
+ final String chunkText = rawText.substring(segment.start(),
segment.end());
+ final Chunk.Builder chunk = Chunk.newBuilder()
+ .setAnnotationSpan(toSpan(segment.start(), segment.end()))
+ .setTextContent(chunkText)
+ .addAllContainedSentenceIndices(segment.sentenceIndices());
+ totalTokens += countTokens(document, segment);
+ for (String modelId : entry.getEmbeddingModelIdsList()) {
+ final float[] vector = embeddingProvider.embed(modelId, chunkText);
+ chunk.addEmbeddings(EmbeddingResult.newBuilder()
+ .setModelId(modelId)
+ .addAllVector(toFloatList(vector))
+ .setSourceSpan(toSpan(segment.start(), segment.end()))
+
.setGranularity(EmbeddingGranularity.EMBEDDING_GRANULARITY_CHUNK_LEVEL)
+ .build());
+ }
+ group.addChunks(chunk.build());
+ }
+
+ group.setStats(ChunkGroupStats.newBuilder()
+ .setChunkCount(segments.size())
+ .setTotalTokens(totalTokens)
+ .setProcessingTimeMs(System.currentTimeMillis() - started)
+ .build());
+ return group.build();
+ }
+
+ /**
+ * Builds a sentence-per-chunk group without embeddings, used when the
{@code CHUNK}
+ * pipeline step runs without chunk+embed configs.
+ *
+ * @param rawText The document text the annotation offsets refer to.
+ * @param document The analyzed document backbone.
+ * @param groupId The id assigned to the resulting group.
+ *
+ * @return The resulting chunk group.
+ */
+ public static ChunkEmbeddingGroup buildSentenceGroup(
+ String rawText, OpenNlpDocument document, String groupId) {
+ final ChunkingSpec spec =
ChunkingSpec.newBuilder().setAlgorithm("sentence").build();
+ final ChunkEmbedConfigEntry entry = ChunkEmbedConfigEntry.newBuilder()
+ .setConfigId(groupId)
+ .setChunking(spec)
+ .build();
+ return buildGroup(rawText, document, entry, new NoOpEmbeddingProvider());
+ }
+
+ /**
+ * @param configId The config id the diagnostic refers to.
+ * @param chunkCount The number of chunks produced for the config.
+ *
+ * @return An INFO diagnostic for a successfully processed chunk config.
+ */
+ public static ProcessingDiagnostic successDiagnostic(String configId, int
chunkCount) {
+ return ProcessingDiagnostic.newBuilder()
+ .setStep(PipelineStep.PIPELINE_STEP_CHUNK)
+ .setSeverity(DiagnosticSeverity.DIAGNOSTIC_SEVERITY_INFO)
+ .setMessage("Produced " + chunkCount + " chunk(s) for config '" +
configId + "'")
+ .build();
+ }
+
+ private static void validateSemanticChunking(ChunkEmbedConfigEntry entry) {
+ final var semantic = entry.getChunking().getSemanticConfig();
+ if (semantic.hasSemanticEmbeddingModelId() &&
!semantic.getSemanticEmbeddingModelId().isBlank()) {
+ return;
+ }
+ if (entry.getEmbeddingModelIdsCount() == 1) {
+ return;
+ }
+ throw AnalysisException.invalidArgument(
+ "semantic chunking requires semantic_embedding_model_id or exactly one
embedding_model_id");
+ }
+
+ private static boolean isSemantic(ChunkingSpec chunking) {
+ return "semantic".equals(chunking.getAlgorithm()) ||
chunking.hasSemanticConfig();
+ }
+
+ private static int countTokens(OpenNlpDocument document,
SegmentationChunker.ChunkSegment segment) {
+ int count = 0;
+ for (int sentenceIndex : segment.sentenceIndices()) {
+ final AnnotatedSentence sentence = document.getSentences(sentenceIndex);
+ for (var token : sentence.getTokensList()) {
+ final AnnotationSpan span = token.getAnnotationSpan();
+ if (span.getStart() < segment.end() && span.getEnd() >
segment.start()) {
+ count++;
+ }
+ }
+ }
+ return count;
+ }
+
+ private static AnnotationSpan toSpan(int start, int end) {
+ return AnnotationSpan.newBuilder()
+ .setStart(start)
+ .setEnd(end)
+ .setSpace(CoordinateSpace.COORDINATE_SPACE_CHAR_DOCUMENT)
+ .build();
+ }
+
+ private static List<Float> toFloatList(float[] vector) {
+ final List<Float> values = new ArrayList<>(vector.length);
+ for (float value : vector) {
+ values.add(value);
+ }
+ return values;
+ }
+
+ /** Embedding provider that rejects embed calls; used for chunk-only groups.
*/
+ private static final class NoOpEmbeddingProvider implements
EmbeddingProvider {
+ @Override
+ public boolean isAvailable() {
+ return false;
+ }
+
+ @Override
+ public Set<String> registeredModelIds() {
+ return Set.of();
+ }
+
+ @Override
+ public boolean supportsModel(String modelId) {
+ return false;
+ }
+
+ @Override
+ public int embeddingDimension(String modelId) {
+ return 0;
+ }
+
+ @Override
+ public float[] embed(String modelId, String text) {
+ throw AnalysisException.failedPrecondition("embeddings were not
requested for this group");
+ }
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/SegmentationChunker.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/SegmentationChunker.java
new file mode 100644
index 00000000..55d9e42b
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/SegmentationChunker.java
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.chunk;
+
+import java.util.ArrayList;
+import java.util.LinkedHashSet;
+import java.util.List;
+import java.util.Set;
+
+import org.apache.opennlp.grpc.embedding.EmbeddingProvider;
+import org.apache.opennlp.grpc.processor.AnalysisException;
+import org.apache.opennlp.grpc.v1.AnnotationSpan;
+import org.apache.opennlp.grpc.v1.ChunkingSpec;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.Token;
+
+/**
+ * RAG-style segmentation chunking over an analyzed document backbone.
+ *
+ * <p>Supported algorithms are {@code sentence} (one chunk per sentence),
{@code token}
+ * (overlapping token windows) and {@code semantic} (topic boundaries from
sentence
+ * embedding similarity, delegated to {@link SemanticChunker}).</p>
+ */
+public final class SegmentationChunker {
+
+ /** Exclusive-end document character offsets plus the sentences touched by
the chunk. */
+ public record ChunkSegment(int start, int end, List<Integer>
sentenceIndices) {
+ }
+
+ private SegmentationChunker() {
+ }
+
+ /**
+ * Segments an analyzed document according to the given chunking spec.
+ *
+ * @param rawText The document text the annotation offsets refer
to.
+ * @param document The analyzed document. Sentence spans are
required; token
+ * spans are additionally required for the {@code
token} algorithm.
+ * @param spec The chunking spec. The algorithm must be set.
+ * @param embeddingProvider The provider used for semantic chunking. Must
not be
+ * {@code null}; it is not consulted for other
algorithms.
+ *
+ * @return The chunk segments in document order. Never {@code null}.
+ *
+ * @throws AnalysisException If the spec is invalid or names an unknown
algorithm.
+ */
+ public static List<ChunkSegment> segment(
+ String rawText,
+ OpenNlpDocument document,
+ ChunkingSpec spec,
+ EmbeddingProvider embeddingProvider) {
+ if (spec.getAlgorithm().isBlank()) {
+ throw AnalysisException.invalidArgument("chunking.algorithm is
required");
+ }
+ if (isSemantic(spec)) {
+ final String modelId = requireSemanticModelId(spec, embeddingProvider);
+ return SemanticChunker.chunk(
+ rawText, document, spec.getSemanticConfig(), embeddingProvider,
modelId);
+ }
+ return switch (spec.getAlgorithm()) {
+ case "sentence" -> sentenceChunks(document);
+ case "token" -> tokenWindowChunks(document, spec);
+ default -> throw AnalysisException.unimplemented(
+ "chunking algorithm '" + spec.getAlgorithm() + "' is not
implemented");
+ };
+ }
+
+ private static boolean isSemantic(ChunkingSpec spec) {
+ return "semantic".equals(spec.getAlgorithm()) || spec.hasSemanticConfig();
+ }
+
+ private static String requireSemanticModelId(ChunkingSpec spec,
EmbeddingProvider provider) {
+ if (!spec.hasSemanticConfig()) {
+ throw AnalysisException.invalidArgument("chunking.semantic_config is
required for semantic chunking");
+ }
+ final var semantic = spec.getSemanticConfig();
+ final String requested = semantic.hasSemanticEmbeddingModelId()
+ ? semantic.getSemanticEmbeddingModelId() : null;
+ final String modelId = provider.resolveModelId(requested);
+ if (modelId == null || modelId.isBlank()) {
+ throw AnalysisException.invalidArgument(
+ "semantic chunking requires semantic_embedding_model_id or exactly
one registered embedding model");
+ }
+ if (!provider.supportsModel(modelId)) {
+ throw AnalysisException.notFound("Unknown semantic embedding model '" +
modelId + "'");
+ }
+ return modelId;
+ }
+
+ private static List<ChunkSegment> sentenceChunks(OpenNlpDocument document) {
+ final List<ChunkSegment> chunks = new ArrayList<>();
+ for (int i = 0; i < document.getSentencesCount(); i++) {
+ final AnnotationSpan span = document.getSentences(i).getSentenceSpan();
+ chunks.add(new ChunkSegment(span.getStart(), span.getEnd(), List.of(i)));
+ }
+ return chunks;
+ }
+
+ private static List<ChunkSegment> tokenWindowChunks(OpenNlpDocument
document, ChunkingSpec spec) {
+ final int chunkSize = spec.getChunkSize();
+ final int chunkOverlap = spec.getChunkOverlap();
+ if (chunkSize <= 0) {
+ throw AnalysisException.invalidArgument("chunking.chunk_size must be
positive for token windows");
+ }
+ if (chunkOverlap < 0 || chunkOverlap >= chunkSize) {
+ throw AnalysisException.invalidArgument(
+ "chunking.chunk_overlap must be >= 0 and < chunk_size");
+ }
+
+ final List<FlatToken> flatTokens = flattenTokens(document);
+ if (flatTokens.isEmpty()) {
+ return List.of();
+ }
+
+ final int step = Math.max(1, chunkSize - chunkOverlap);
+ final List<ChunkSegment> chunks = new ArrayList<>();
+ for (int startToken = 0; startToken < flatTokens.size(); startToken +=
step) {
+ final int endToken = Math.min(startToken + chunkSize, flatTokens.size())
- 1;
+ final FlatToken first = flatTokens.get(startToken);
+ final FlatToken last = flatTokens.get(endToken);
+ chunks.add(new ChunkSegment(
+ first.start(),
+ last.end(),
+ sentenceIndices(flatTokens, startToken, endToken)));
+ if (endToken == flatTokens.size() - 1) {
+ break;
+ }
+ }
+ return chunks;
+ }
+
+ private static List<FlatToken> flattenTokens(OpenNlpDocument document) {
+ final List<FlatToken> tokens = new ArrayList<>();
+ for (int sentenceIndex = 0; sentenceIndex < document.getSentencesCount();
sentenceIndex++) {
+ for (Token token : document.getSentences(sentenceIndex).getTokensList())
{
+ final AnnotationSpan span = token.getAnnotationSpan();
+ tokens.add(new FlatToken(span.getStart(), span.getEnd(),
sentenceIndex));
+ }
+ }
+ return tokens;
+ }
+
+ private static List<Integer> sentenceIndices(
+ List<FlatToken> flatTokens, int startToken, int endToken) {
+ final Set<Integer> indices = new LinkedHashSet<>();
+ for (int i = startToken; i <= endToken; i++) {
+ indices.add(flatTokens.get(i).sentenceIndex());
+ }
+ return List.copyOf(indices);
+ }
+
+ private record FlatToken(int start, int end, int sentenceIndex) {
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/SemanticChunker.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/SemanticChunker.java
new file mode 100644
index 00000000..8f2fa8c9
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/chunk/SemanticChunker.java
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.chunk;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.opennlp.grpc.embedding.EmbeddingProvider;
+import org.apache.opennlp.grpc.processor.AnalysisException;
+import org.apache.opennlp.grpc.v1.AnnotatedSentence;
+import org.apache.opennlp.grpc.v1.AnnotationSpan;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.SemanticChunkingConfig;
+
+/**
+ * Topic-boundary chunking using consecutive sentence embedding similarity.
+ *
+ * <p>Every sentence is embedded individually and a chunk boundary is placed
wherever the
+ * cosine similarity of two consecutive sentences falls below the threshold.
The threshold
+ * is, in order of precedence, the {@code percentile_threshold} over the
observed
+ * similarities, the explicit {@code similarity_threshold}, or
+ * {@value #DEFAULT_SIMILARITY_THRESHOLD}.</p>
+ *
+ * <p>Size constraints are applied after boundary detection: chunks smaller
than
+ * {@code min_chunk_sentences} are merged first, then chunks larger than
+ * {@code max_chunk_sentences} are split. The maximum therefore always holds,
while the
+ * minimum may be violated by a split remainder.</p>
+ */
+public final class SemanticChunker {
+
+ static final float DEFAULT_SIMILARITY_THRESHOLD = 0.5f;
+
+ private SemanticChunker() {
+ }
+
+ /**
+ * Chunks the analyzed document at semantic topic boundaries.
+ *
+ * @param rawText The document text the sentence spans refer to.
+ * @param document The analyzed document. Sentence spans are
required.
+ * @param config The semantic chunking configuration.
+ * @param embeddingProvider The provider used to embed each sentence.
+ * @param modelId The id of a registered embedding model.
+ *
+ * @return The chunk segments in document order. Never {@code null}.
+ *
+ * @throws AnalysisException If the configuration is invalid or embedding
fails.
+ */
+ public static List<SegmentationChunker.ChunkSegment> chunk(
+ String rawText,
+ OpenNlpDocument document,
+ SemanticChunkingConfig config,
+ EmbeddingProvider embeddingProvider,
+ String modelId) {
+ if (document.getSentencesCount() == 0) {
+ return List.of();
+ }
+ if (document.getSentencesCount() == 1) {
+ final AnnotationSpan span = document.getSentences(0).getSentenceSpan();
+ return List.of(new SegmentationChunker.ChunkSegment(span.getStart(),
span.getEnd(), List.of(0)));
+ }
+
+ final int sentenceCount = document.getSentencesCount();
+ final float[][] embeddings = new float[sentenceCount][];
+ for (int i = 0; i < sentenceCount; i++) {
+ final AnnotationSpan span = document.getSentences(i).getSentenceSpan();
+ final String sentenceText = rawText.substring(span.getStart(),
span.getEnd());
+ embeddings[i] = embeddingProvider.embed(modelId, sentenceText);
+ }
+
+ final float[] similarities = new float[sentenceCount - 1];
+ for (int i = 0; i < similarities.length; i++) {
+ similarities[i] = cosineSimilarity(embeddings[i], embeddings[i + 1]);
+ }
+
+ final float threshold = resolveThreshold(config, similarities);
+ final int minSentences = config.getMinChunkSentences() > 0 ?
config.getMinChunkSentences() : 1;
+ final int maxSentences =
+ config.getMaxChunkSentences() > 0 ? config.getMaxChunkSentences() :
Integer.MAX_VALUE;
+
+ final List<Integer> starts = new ArrayList<>();
+ starts.add(0);
+ for (int i = 0; i < similarities.length; i++) {
+ if (similarities[i] < threshold) {
+ starts.add(i + 1);
+ }
+ }
+
+ mergeSmallChunks(starts, minSentences, sentenceCount);
+ splitLargeChunks(starts, maxSentences, sentenceCount);
+
+ final List<SegmentationChunker.ChunkSegment> chunks = new ArrayList<>();
+ for (int i = 0; i < starts.size(); i++) {
+ final int startSentence = starts.get(i);
+ final int endSentence = i + 1 < starts.size() ? starts.get(i + 1) - 1 :
sentenceCount - 1;
+ chunks.add(toSegment(rawText, document, startSentence, endSentence));
+ }
+ return chunks;
+ }
+
+ private static float resolveThreshold(SemanticChunkingConfig config, float[]
similarities) {
+ if (config.getPercentileThreshold() > 0) {
+ if (config.getPercentileThreshold() >= 100) {
+ throw
AnalysisException.invalidArgument("semantic_config.percentile_threshold must be
< 100");
+ }
+ return percentile(similarities, config.getPercentileThreshold());
+ }
+ if (config.getSimilarityThreshold() > 0f) {
+ return config.getSimilarityThreshold();
+ }
+ return DEFAULT_SIMILARITY_THRESHOLD;
+ }
+
+ private static float percentile(float[] values, int percentile) {
+ final float[] sorted = values.clone();
+ Arrays.sort(sorted);
+ final int index = Math.max(0, Math.min(sorted.length - 1,
+ (int) Math.ceil(percentile / 100.0 * sorted.length) - 1));
+ return sorted[index];
+ }
+
+ /**
+ * Merges chunks smaller than {@code minSentences} into a neighbour. An
undersized chunk
+ * absorbs the following chunk; an undersized final chunk is absorbed by the
preceding
+ * one. A single chunk covering the whole document is never merged away, so
documents
+ * with fewer than {@code minSentences} sentences yield one chunk.
+ */
+ private static void mergeSmallChunks(List<Integer> starts, int minSentences,
int sentenceCount) {
+ if (minSentences <= 1) {
+ return;
+ }
+ int index = 0;
+ while (index < starts.size()) {
+ final int chunkStart = starts.get(index);
+ final int chunkEnd = index + 1 < starts.size() ? starts.get(index + 1) -
1 : sentenceCount - 1;
+ if (chunkEnd - chunkStart + 1 >= minSentences) {
+ index++;
+ } else if (index + 1 < starts.size()) {
+ // Absorb the following chunk, then re-check the grown chunk at the
same index.
+ starts.remove(index + 1);
+ } else if (index > 0) {
+ // Undersized final chunk: absorb it into the preceding chunk.
+ starts.remove(index);
+ } else {
+ break;
+ }
+ }
+ }
+
+ /**
+ * Splits chunks larger than {@code maxSentences} into consecutive windows
of at most
+ * {@code maxSentences} sentences.
+ */
+ private static void splitLargeChunks(List<Integer> starts, int maxSentences,
int sentenceCount) {
+ int index = 0;
+ while (index < starts.size()) {
+ final int chunkStart = starts.get(index);
+ final int chunkEnd = index + 1 < starts.size() ? starts.get(index + 1) -
1 : sentenceCount - 1;
+ final int size = chunkEnd - chunkStart + 1;
+ if (size <= maxSentences) {
+ index++;
+ continue;
+ }
+ int splitAt = chunkStart + maxSentences;
+ starts.add(index + 1, splitAt);
+ index++;
+ }
+ }
+
+ private static SegmentationChunker.ChunkSegment toSegment(
+ String rawText,
+ OpenNlpDocument document,
+ int startSentence,
+ int endSentence) {
+ final AnnotatedSentence first = document.getSentences(startSentence);
+ final AnnotatedSentence last = document.getSentences(endSentence);
+ final int start = first.getSentenceSpan().getStart();
+ final int end = last.getSentenceSpan().getEnd();
+ final List<Integer> sentenceIndices = new ArrayList<>();
+ for (int i = startSentence; i <= endSentence; i++) {
+ sentenceIndices.add(i);
+ }
+ return new SegmentationChunker.ChunkSegment(start, end,
List.copyOf(sentenceIndices));
+ }
+
+ static float cosineSimilarity(float[] left, float[] right) {
+ if (left.length != right.length) {
+ throw AnalysisException.invalidArgument("Embedding dimension mismatch
during semantic chunking");
+ }
+ double dot = 0;
+ double leftNorm = 0;
+ double rightNorm = 0;
+ for (int i = 0; i < left.length; i++) {
+ dot += left[i] * right[i];
+ leftNorm += left[i] * left[i];
+ rightNorm += right[i] * right[i];
+ }
+ if (leftNorm == 0 || rightNorm == 0) {
+ return 0f;
+ }
+ return (float) (dot / (Math.sqrt(leftNorm) * Math.sqrt(rightNorm)));
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/AbstractOnnxEmbeddingProvider.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/AbstractOnnxEmbeddingProvider.java
new file mode 100644
index 00000000..13fca998
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/AbstractOnnxEmbeddingProvider.java
@@ -0,0 +1,269 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.embedding;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+
+import ai.onnxruntime.OrtException;
+import org.apache.opennlp.grpc.processor.AnalysisException;
+import org.apache.opennlp.grpc.v1.InferenceBackend;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Base class for ONNX Runtime backed {@link EmbeddingProvider}
implementations.
+ *
+ * <p>Embedding models are declared in the server configuration with one ONNX
path and
+ * one vocabulary path per model id:</p>
+ *
+ * <pre>
+ * model.embedder.<model-id>.onnx.path=/models/minilm.onnx
+ * model.embedder.<model-id>.vocab.path=/models/minilm-vocab.txt
+ * model.embedder.default_id=<model-id> (optional, required
with multiple models)
+ * model.embedder.gpu_device_id=<ordinal> (CUDA backends only)
+ * </pre>
+ *
+ * <p>All configured models are loaded eagerly so that misconfiguration fails
at server
+ * startup rather than on the first request. Subclasses only declare which
+ * {@link InferenceBackend} values they serve.</p>
+ */
+abstract class AbstractOnnxEmbeddingProvider implements EmbeddingProvider,
AutoCloseable {
+
+ private static final Logger logger =
LoggerFactory.getLogger(AbstractOnnxEmbeddingProvider.class);
+
+ private static final String KEY_PREFIX = "model.embedder.";
+ private static final String KEY_ONNX_SUFFIX = ".onnx.path";
+ private static final String KEY_VOCAB_SUFFIX = ".vocab.path";
+ private static final String KEY_DEFAULT_ID = "model.embedder.default_id";
+ private static final String KEY_GPU_DEVICE = "model.embedder.gpu_device_id";
+
+ private final Map<String, OnnxSentenceEmbedder> models;
+ private final String defaultModelId;
+
+ /**
+ * Loads all configured embedding models.
+ *
+ * @param configuration The server configuration. Must not be {@code null}.
+ * @param useCuda Whether models run on the CUDA execution provider.
+ *
+ * @throws AnalysisException If the configuration is inconsistent, a
referenced file is
+ * missing, or a model fails to load.
+ */
+ AbstractOnnxEmbeddingProvider(Map<String, String> configuration, boolean
useCuda) {
+ Objects.requireNonNull(configuration, "configuration must not be null");
+ final int gpuDeviceId = gpuDeviceId(configuration, useCuda);
+ this.models = loadModels(configuration, useCuda, gpuDeviceId);
+ this.defaultModelId = resolveDefaultModelId(configuration, models);
+ }
+
+ /**
+ * @return The {@link InferenceBackend} values this provider serves, in
addition to
+ * {@code UNSPECIFIED} and {@code OPENNLP_ME} which every provider
accepts.
+ */
+ abstract Set<InferenceBackend> supportedBackends();
+
+ @Override
+ public boolean isAvailable() {
+ return !models.isEmpty();
+ }
+
+ @Override
+ public Set<String> registeredModelIds() {
+ return models.keySet();
+ }
+
+ @Override
+ public boolean supportsModel(String modelId) {
+ return modelId != null && !modelId.isBlank() &&
models.containsKey(modelId);
+ }
+
+ @Override
+ public int embeddingDimension(String modelId) {
+ return requireModel(modelId).embeddingDimension();
+ }
+
+ @Override
+ public float[] embed(String modelId, String text) {
+ Objects.requireNonNull(text, "text must not be null");
+ final OnnxSentenceEmbedder embedder = requireModel(modelId);
+ try {
+ return embedder.embed(text);
+ } catch (OrtException e) {
+ throw AnalysisException.internal("Embedding inference failed for model
'" + modelId + "'", e);
+ }
+ }
+
+ @Override
+ public String resolveModelId(String requestedModelId) {
+ if (requestedModelId != null && !requestedModelId.isBlank()) {
+ return requestedModelId;
+ }
+ if (defaultModelId != null) {
+ return defaultModelId;
+ }
+ return models.size() == 1 ? models.keySet().iterator().next() : null;
+ }
+
+ @Override
+ public boolean supportsInferenceBackend(InferenceBackend backend) {
+ return backend == InferenceBackend.INFERENCE_BACKEND_UNSPECIFIED
+ || backend == InferenceBackend.INFERENCE_BACKEND_OPENNLP_ME
+ || supportedBackends().contains(backend);
+ }
+
+ /**
+ * Closes all loaded ONNX sessions. Failures are logged and do not abort the
shutdown
+ * of the remaining models.
+ */
+ @Override
+ public void close() {
+ for (Map.Entry<String, OnnxSentenceEmbedder> entry : models.entrySet()) {
+ try {
+ entry.getValue().close();
+ } catch (OrtException e) {
+ logger.warn("Failed to close embedding model '{}'", entry.getKey(), e);
+ }
+ }
+ }
+
+ private OnnxSentenceEmbedder requireModel(String modelId) {
+ if (modelId == null || modelId.isBlank()) {
+ throw AnalysisException.invalidArgument("embedding model id is
required");
+ }
+ final OnnxSentenceEmbedder embedder = models.get(modelId);
+ if (embedder == null) {
+ throw AnalysisException.notFound("Unknown embedding model '" + modelId +
"'");
+ }
+ return embedder;
+ }
+
+ private static int gpuDeviceId(Map<String, String> configuration, boolean
useCuda) {
+ final String configured = configuration.get(KEY_GPU_DEVICE);
+ if (configured == null || configured.isBlank()) {
+ return 0;
+ }
+ if (!useCuda) {
+ throw AnalysisException.invalidArgument(
+ KEY_GPU_DEVICE + " requires model.embedder.backend=cuda");
+ }
+ try {
+ return Integer.parseInt(configured.trim());
+ } catch (NumberFormatException e) {
+ throw AnalysisException.invalidArgument(
+ KEY_GPU_DEVICE + " must be an integer: " + configured);
+ }
+ }
+
+ private static Map<String, OnnxSentenceEmbedder> loadModels(
+ Map<String, String> configuration, boolean useCuda, int gpuDeviceId) {
+ final Map<String, String> onnxPaths = new HashMap<>();
+ final Map<String, String> vocabPaths = new HashMap<>();
+
+ for (Map.Entry<String, String> entry : configuration.entrySet()) {
+ final String key = entry.getKey();
+ if (!key.startsWith(KEY_PREFIX) || key.equals(KEY_DEFAULT_ID) ||
key.equals(KEY_GPU_DEVICE)) {
+ continue;
+ }
+ final String suffix;
+ if (key.endsWith(KEY_ONNX_SUFFIX)) {
+ suffix = KEY_ONNX_SUFFIX;
+ } else if (key.endsWith(KEY_VOCAB_SUFFIX)) {
+ suffix = KEY_VOCAB_SUFFIX;
+ } else {
+ continue;
+ }
+ final String modelId = key.substring(KEY_PREFIX.length(), key.length() -
suffix.length());
+ final String path = entry.getValue();
+ if (modelId.isBlank() || path == null || path.isBlank()) {
+ continue;
+ }
+ if (suffix.equals(KEY_ONNX_SUFFIX)) {
+ onnxPaths.put(modelId, path);
+ } else {
+ vocabPaths.put(modelId, path);
+ }
+ }
+
+ final Map<String, OnnxSentenceEmbedder> loaded = new HashMap<>();
+ try {
+ for (Map.Entry<String, String> entry : onnxPaths.entrySet()) {
+ final String modelId = entry.getKey();
+ final String vocabPath = vocabPaths.get(modelId);
+ if (vocabPath == null) {
+ throw AnalysisException.invalidArgument(
+ KEY_PREFIX + modelId + KEY_VOCAB_SUFFIX
+ + " is required when an ONNX path is configured");
+ }
+ loaded.put(modelId, loadModel(modelId, entry.getValue(), vocabPath,
useCuda, gpuDeviceId));
+ }
+ } catch (RuntimeException e) {
+ for (OnnxSentenceEmbedder embedder : loaded.values()) {
+ try {
+ embedder.close();
+ } catch (OrtException closeFailure) {
+ e.addSuppressed(closeFailure);
+ }
+ }
+ throw e;
+ }
+ return Map.copyOf(loaded);
+ }
+
+ private static OnnxSentenceEmbedder loadModel(
+ String modelId, String onnxPath, String vocabPath, boolean useCuda, int
gpuDeviceId) {
+ final File onnxFile = new File(onnxPath);
+ final File vocabFile = new File(vocabPath);
+ if (!onnxFile.isFile()) {
+ throw AnalysisException.notFound(
+ "ONNX embedding model file not found for '" + modelId + "': " +
onnxFile.getAbsolutePath());
+ }
+ if (!vocabFile.isFile()) {
+ throw AnalysisException.notFound(
+ "Embedding vocabulary file not found for '" + modelId + "': " +
vocabFile.getAbsolutePath());
+ }
+ try {
+ final OnnxSentenceEmbedder embedder =
+ new OnnxSentenceEmbedder(onnxFile, vocabFile, useCuda, gpuDeviceId);
+ logger.info("Loaded embedding model '{}' (dimension={}, backend={})",
+ modelId, embedder.embeddingDimension(), useCuda ? "CUDA" : "ONNX
Runtime CPU");
+ return embedder;
+ } catch (OrtException | IOException e) {
+ final String backend = useCuda ? "CUDA" : "ONNX Runtime CPU";
+ throw AnalysisException.internal(
+ "Failed to load embedding model '" + modelId + "' on " + backend, e);
+ }
+ }
+
+ private static String resolveDefaultModelId(
+ Map<String, String> configuration, Map<String, OnnxSentenceEmbedder>
models) {
+ final String configured = configuration.get(KEY_DEFAULT_ID);
+ if (configured != null && !configured.isBlank()) {
+ if (!models.containsKey(configured)) {
+ throw AnalysisException.notFound(
+ KEY_DEFAULT_ID + " '" + configured + "' is not registered");
+ }
+ return configured;
+ }
+ return models.size() == 1 ? models.keySet().iterator().next() : null;
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/CudaEmbeddingProvider.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/CudaEmbeddingProvider.java
new file mode 100644
index 00000000..204db27c
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/CudaEmbeddingProvider.java
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.embedding;
+
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.opennlp.grpc.v1.InferenceBackend;
+
+/**
+ * ONNX Runtime embedding provider running on the CUDA execution provider.
+ *
+ * <p>Serves {@code INFERENCE_BACKEND_CUDA} and {@code
INFERENCE_BACKEND_ONNX_RUNTIME_GPU}
+ * requests. Requires a server built with the {@code gpu} Maven profile, which
replaces
+ * the {@code onnxruntime} jar with {@code onnxruntime_gpu}, and a CUDA
capable device at
+ * runtime. The device is selected with {@code model.embedder.gpu_device_id}.
See
+ * {@link AbstractOnnxEmbeddingProvider} for the model configuration keys.</p>
+ */
+public final class CudaEmbeddingProvider extends AbstractOnnxEmbeddingProvider
{
+
+ /**
+ * Loads all configured embedding models on the CUDA device.
+ *
+ * @param configuration The server configuration. Must not be {@code null}.
+ */
+ public CudaEmbeddingProvider(Map<String, String> configuration) {
+ super(configuration, true);
+ }
+
+ @Override
+ Set<InferenceBackend> supportedBackends() {
+ return Set.of(
+ InferenceBackend.INFERENCE_BACKEND_CUDA,
+ InferenceBackend.INFERENCE_BACKEND_ONNX_RUNTIME_GPU);
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/EmbeddingProvider.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/EmbeddingProvider.java
new file mode 100644
index 00000000..9e750191
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/EmbeddingProvider.java
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.embedding;
+
+import java.util.Set;
+
+import org.apache.opennlp.grpc.v1.InferenceBackend;
+
+/**
+ * Local embedding backend for the {@code PIPELINE_STEP_EMBED} pipeline step.
+ *
+ * <p>Implementations own their model lifecycle: models are registered at
construction
+ * time and identified by a stable model id. Implementations that hold native
resources
+ * should also implement {@link AutoCloseable}; the server closes such
providers on
+ * shutdown.</p>
+ */
+public interface EmbeddingProvider {
+
+ /**
+ * @return {@code true} when at least one embedding model is registered.
+ */
+ boolean isAvailable();
+
+ /**
+ * @return The ids of all registered embedding models. Never {@code null}.
+ */
+ Set<String> registeredModelIds();
+
+ /**
+ * @param modelId The model id to check. May be {@code null} or blank.
+ *
+ * @return {@code true} when the given id refers to a registered embedding
model.
+ */
+ boolean supportsModel(String modelId);
+
+ /**
+ * @param modelId The id of a registered embedding model.
+ *
+ * @return The dimension of the vectors produced by the model.
+ */
+ int embeddingDimension(String modelId);
+
+ /**
+ * Embeds the given text.
+ *
+ * @param modelId The id of a registered embedding model.
+ * @param text The text to embed. Must not be {@code null}.
+ *
+ * @return The embedding vector of length {@link
#embeddingDimension(String)}.
+ */
+ float[] embed(String modelId, String text);
+
+ /**
+ * Resolves the effective model id from an optional client override.
+ *
+ * @param requestedModelId The model id requested by the client. May be
{@code null}
+ * or blank when the client wants the server default.
+ *
+ * @return The model id to use, or {@code null} when no default can be
determined.
+ */
+ default String resolveModelId(String requestedModelId) {
+ if (requestedModelId != null && !requestedModelId.isBlank()) {
+ return requestedModelId;
+ }
+ if (registeredModelIds().size() == 1) {
+ return registeredModelIds().iterator().next();
+ }
+ return null;
+ }
+
+ /**
+ * @param backend The inference backend requested by the client.
+ *
+ * @return {@code true} when the provider can serve the requested inference
backend.
+ * {@code UNSPECIFIED} and {@code OPENNLP_ME} are always accepted
because they
+ * do not constrain the embedding backend.
+ */
+ default boolean supportsInferenceBackend(InferenceBackend backend) {
+ return backend == InferenceBackend.INFERENCE_BACKEND_UNSPECIFIED
+ || backend == InferenceBackend.INFERENCE_BACKEND_OPENNLP_ME
+ || backend == InferenceBackend.INFERENCE_BACKEND_ONNX_RUNTIME;
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/EmbeddingProviderFactory.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/EmbeddingProviderFactory.java
new file mode 100644
index 00000000..98fe9195
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/EmbeddingProviderFactory.java
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.embedding;
+
+import java.util.Locale;
+import java.util.Map;
+
+import org.apache.opennlp.grpc.processor.AnalysisException;
+
+/**
+ * Creates the configured {@link EmbeddingProvider} for the gRPC server.
+ *
+ * <p>The backend is selected with the {@code model.embedder.backend}
configuration key.
+ * Supported values are {@value #BACKEND_ONNX} (the default, ONNX Runtime on
CPU) and
+ * {@value #BACKEND_CUDA} (ONNX Runtime with the CUDA execution provider;
requires a
+ * server built with the {@code gpu} Maven profile). Any other value is
rejected.</p>
+ */
+public final class EmbeddingProviderFactory {
+
+ static final String KEY_BACKEND = "model.embedder.backend";
+ static final String BACKEND_ONNX = "onnx";
+ static final String BACKEND_CUDA = "cuda";
+
+ private EmbeddingProviderFactory() {
+ }
+
+ /**
+ * Creates the embedding provider declared by the server configuration.
+ *
+ * @param configuration The server configuration. Must not be {@code null}.
+ *
+ * @return The configured provider. Never {@code null}.
+ *
+ * @throws AnalysisException If the configured backend is unknown or the
provider's
+ * model configuration is invalid.
+ */
+ public static EmbeddingProvider create(Map<String, String> configuration) {
+ final String backend =
+ configuration.getOrDefault(KEY_BACKEND,
BACKEND_ONNX).trim().toLowerCase(Locale.ROOT);
+ return switch (backend) {
+ case BACKEND_ONNX -> new OnnxRuntimeEmbeddingProvider(configuration);
+ case BACKEND_CUDA -> new CudaEmbeddingProvider(configuration);
+ default -> throw AnalysisException.invalidArgument(
+ KEY_BACKEND + " '" + backend + "' is not supported; expected one of:
"
+ + BACKEND_ONNX + ", " + BACKEND_CUDA);
+ };
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/OnnxRuntimeEmbeddingProvider.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/OnnxRuntimeEmbeddingProvider.java
new file mode 100644
index 00000000..ce2f6375
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/OnnxRuntimeEmbeddingProvider.java
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.embedding;
+
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.opennlp.grpc.v1.InferenceBackend;
+
+/**
+ * ONNX Runtime embedding provider running on the CPU execution provider.
+ *
+ * <p>Serves {@code INFERENCE_BACKEND_ONNX_RUNTIME} requests. See
+ * {@link AbstractOnnxEmbeddingProvider} for the model configuration keys.</p>
+ */
+public final class OnnxRuntimeEmbeddingProvider extends
AbstractOnnxEmbeddingProvider {
+
+ /**
+ * Loads all configured embedding models on the CPU.
+ *
+ * @param configuration The server configuration. Must not be {@code null}.
+ */
+ public OnnxRuntimeEmbeddingProvider(Map<String, String> configuration) {
+ super(configuration, false);
+ }
+
+ @Override
+ Set<InferenceBackend> supportedBackends() {
+ return Set.of(InferenceBackend.INFERENCE_BACKEND_ONNX_RUNTIME);
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/OnnxSentenceEmbedder.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/OnnxSentenceEmbedder.java
new file mode 100644
index 00000000..f6710539
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/embedding/OnnxSentenceEmbedder.java
@@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.embedding;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.LongBuffer;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Set;
+
+import ai.onnxruntime.NodeInfo;
+import ai.onnxruntime.OnnxTensor;
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtException;
+import ai.onnxruntime.OrtSession;
+import ai.onnxruntime.TensorInfo;
+import opennlp.dl.AbstractDL;
+import opennlp.tools.tokenize.WordpieceTokenizer;
+
+/**
+ * Computes sentence embeddings with a BERT-style ONNX model and a wordpiece
vocabulary.
+ *
+ * <p>This embedder is the inference core behind {@link
AbstractOnnxEmbeddingProvider}. It
+ * reuses the vocabulary loading and wordpiece tokenizer selection of {@link
AbstractDL}
+ * (BERT or RoBERTa special tokens, chosen from the vocabulary contents) and
adds the
+ * pieces {@code opennlp-dl}'s {@code SentenceVectorsDL} does not offer: an
optional CUDA
+ * execution provider, session-metadata based dimension discovery and
deterministic native
+ * resource management.</p>
+ *
+ * <p>Model input conventions follow the standard single-segment BERT encoding:
+ * {@code attention_mask} is {@code 1} for every real token and {@code
token_type_ids}
+ * is {@code 0} throughout. Inputs the model does not declare (many
sentence-transformers
+ * exports omit {@code token_type_ids}) are not sent. The embedding is the
hidden state of
+ * the leading classification token ({@code [CLS]} / {@code <s>}).</p>
+ *
+ * <p>Token sequences are truncated to {@link #MAX_SEQUENCE_TOKENS} wordpieces
(the
+ * trailing separator token is preserved) so that inputs never exceed the
positional range
+ * of BERT-style encoders.</p>
+ */
+final class OnnxSentenceEmbedder extends AbstractDL {
+
+ /** Maximum wordpiece sequence length accepted by BERT-style encoders. */
+ static final int MAX_SEQUENCE_TOKENS = 512;
+
+ private final Set<String> declaredInputs;
+ private final long unknownTokenId;
+ private final int embeddingDimension;
+
+ /**
+ * Loads the ONNX model and vocabulary and prepares an inference session.
+ *
+ * @param model The ONNX model file. Must exist.
+ * @param vocabulary The wordpiece vocabulary file matching the model. Must
exist.
+ * @param useCuda Whether to register the CUDA execution provider.
+ * @param gpuDeviceId The CUDA device ordinal; ignored when {@code useCuda}
is {@code false}.
+ *
+ * @throws OrtException If the ONNX session cannot be created or the model
does not
+ * declare a static embedding dimension.
+ * @throws IOException If the vocabulary cannot be read or lacks the
special tokens
+ * required by the wordpiece tokenizer.
+ */
+ OnnxSentenceEmbedder(File model, File vocabulary, boolean useCuda, int
gpuDeviceId)
+ throws OrtException, IOException {
+ env = OrtEnvironment.getEnvironment();
+ try (OrtSession.SessionOptions sessionOptions = new
OrtSession.SessionOptions()) {
+ if (useCuda) {
+ sessionOptions.addCUDA(gpuDeviceId);
+ }
+ session = env.createSession(model.getPath(), sessionOptions);
+ }
+ try {
+ vocab = loadVocab(vocabulary);
+ tokenizer = createTokenizer(vocab);
+ unknownTokenId = requireSpecialTokens(vocab);
+ declaredInputs = Set.copyOf(session.getInputNames());
+ embeddingDimension = readEmbeddingDimension(session, model);
+ } catch (OrtException | IOException | RuntimeException e) {
+ try {
+ session.close();
+ } catch (OrtException closeFailure) {
+ e.addSuppressed(closeFailure);
+ }
+ throw e;
+ }
+ }
+
+ /**
+ * @return The embedding dimension declared by the model's output metadata.
+ */
+ int embeddingDimension() {
+ return embeddingDimension;
+ }
+
+ /**
+ * Embeds the given text.
+ *
+ * @param text The text to embed. Must not be {@code null}.
+ *
+ * @return The embedding vector of length {@link #embeddingDimension()}.
+ *
+ * @throws OrtException If inference fails.
+ */
+ float[] embed(String text) throws OrtException {
+ final long[] ids = tokenIds(text);
+ final long[] mask = new long[ids.length];
+ Arrays.fill(mask, 1);
+ final long[] types = new long[ids.length];
+ final long[] shape = {1, ids.length};
+
+ final Map<String, OnnxTensor> inputs = new HashMap<>();
+ try {
+ inputs.put(INPUT_IDS, OnnxTensor.createTensor(env, LongBuffer.wrap(ids),
shape));
+ if (declaredInputs.contains(ATTENTION_MASK)) {
+ inputs.put(ATTENTION_MASK, OnnxTensor.createTensor(env,
LongBuffer.wrap(mask), shape));
+ }
+ if (declaredInputs.contains(TOKEN_TYPE_IDS)) {
+ inputs.put(TOKEN_TYPE_IDS, OnnxTensor.createTensor(env,
LongBuffer.wrap(types), shape));
+ }
+ try (OrtSession.Result result = session.run(inputs)) {
+ // getValue() copies the tensor into Java arrays, so the result can be
closed safely.
+ final float[][][] hiddenStates = (float[][][])
result.get(0).getValue();
+ return hiddenStates[0][0];
+ }
+ } finally {
+ inputs.values().forEach(OnnxTensor::close);
+ }
+ }
+
+ /**
+ * Closes the inference session. The shared {@link OrtEnvironment} singleton
is left
+ * open intentionally because other models may still be using it.
+ */
+ @Override
+ public void close() throws OrtException {
+ session.close();
+ }
+
+ private long[] tokenIds(String text) {
+ String[] tokens = tokenizer.tokenize(text);
+ if (tokens.length > MAX_SEQUENCE_TOKENS) {
+ final String separator = tokens[tokens.length - 1];
+ tokens = Arrays.copyOf(tokens, MAX_SEQUENCE_TOKENS);
+ tokens[MAX_SEQUENCE_TOKENS - 1] = separator;
+ }
+ final long[] ids = new long[tokens.length];
+ for (int i = 0; i < tokens.length; i++) {
+ final Integer id = vocab.get(tokens[i]);
+ ids[i] = id != null ? id : unknownTokenId;
+ }
+ return ids;
+ }
+
+ /**
+ * Verifies that the special tokens selected by {@link
AbstractDL#createTokenizer(Map)}
+ * are present in the vocabulary, so that every tokenizer output can be
mapped to an id.
+ *
+ * @return The id of the unknown token.
+ */
+ private static long requireSpecialTokens(Map<String, Integer> vocab) throws
IOException {
+ final boolean roberta =
vocab.containsKey(WordpieceTokenizer.ROBERTA_CLS_TOKEN);
+ final String cls = roberta
+ ? WordpieceTokenizer.ROBERTA_CLS_TOKEN :
WordpieceTokenizer.BERT_CLS_TOKEN;
+ final String sep = roberta
+ ? WordpieceTokenizer.ROBERTA_SEP_TOKEN :
WordpieceTokenizer.BERT_SEP_TOKEN;
+ final String unk = roberta
+ ? WordpieceTokenizer.ROBERTA_UNK_TOKEN :
WordpieceTokenizer.BERT_UNK_TOKEN;
+ for (String token : new String[] {cls, sep, unk}) {
+ if (!vocab.containsKey(token)) {
+ throw new IOException("Embedding vocabulary does not define the
special token '"
+ + token + "'; the vocabulary file does not match the model");
+ }
+ }
+ return vocab.get(unk);
+ }
+
+ /**
+ * Reads the embedding dimension from the last axis of the model's first
output tensor.
+ */
+ private static int readEmbeddingDimension(OrtSession session, File model)
throws OrtException {
+ final NodeInfo output = session.getOutputInfo().values().iterator().next();
+ if (!(output.getInfo() instanceof TensorInfo tensorInfo)) {
+ throw new OrtException("Embedding model output '" + output.getName()
+ + "' of " + model.getName() + " is not a tensor");
+ }
+ final long[] shape = tensorInfo.getShape();
+ final long dimension = shape.length > 0 ? shape[shape.length - 1] : -1;
+ if (dimension <= 0 || dimension > Integer.MAX_VALUE) {
+ throw new OrtException("Embedding model " + model.getName()
+ + " does not declare a static embedding dimension (output shape: "
+ + Arrays.toString(shape) + ")");
+ }
+ return (int) dimension;
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/model/ModelBundleCache.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/model/ModelBundleCache.java
index 0b48e1f7..64169ab9 100644
---
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/model/ModelBundleCache.java
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/model/ModelBundleCache.java
@@ -33,12 +33,16 @@ import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
+import org.apache.opennlp.grpc.embedding.EmbeddingProvider;
+import org.apache.opennlp.grpc.embedding.EmbeddingProviderFactory;
import org.apache.opennlp.grpc.profile.ProfileRegistry;
import org.apache.opennlp.grpc.processor.AnalysisException;
import org.apache.opennlp.grpc.v1.ComponentType;
import org.apache.opennlp.grpc.v1.ModelBundleInfo;
import org.apache.opennlp.grpc.v1.ModelDescriptor;
import org.apache.opennlp.grpc.v1.PipelineStep;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
/**
* Loads shared thread-safe {@code *ME} singletons once at startup.
@@ -48,6 +52,8 @@ import org.apache.opennlp.grpc.v1.PipelineStep;
*/
public final class ModelBundleCache {
+ private static final Logger logger =
LoggerFactory.getLogger(ModelBundleCache.class);
+
private static final String DEFAULT_LANGUAGE = "en";
private static final String KEY_SENTDETECT_PATH =
"model.sentence_detector.path";
private static final String KEY_TOKENIZER_PATH = "model.tokenizer.path";
@@ -56,12 +62,14 @@ public final class ModelBundleCache {
private final Map<String, ModelBundleInfo> bundles;
private final SentenceDetectorME sentenceDetector;
private final TokenizerME tokenizer;
+ private final EmbeddingProvider embeddingProvider;
public ModelBundleCache(Map<String, String> configuration) {
Objects.requireNonNull(configuration, "configuration");
this.modelProvider = new DefaultClassPathModelProvider();
this.sentenceDetector = loadSentenceDetector(configuration);
this.tokenizer = loadTokenizer(configuration);
+ this.embeddingProvider = EmbeddingProviderFactory.create(configuration);
this.bundles = buildBundleCatalog();
}
@@ -77,6 +85,24 @@ public final class ModelBundleCache {
return new ArrayList<>(bundles.values());
}
+ public EmbeddingProvider getEmbeddingProvider() {
+ return embeddingProvider;
+ }
+
+ /**
+ * Releases resources held by the embedding provider. Failures are logged so
that the
+ * remaining server shutdown is not interrupted.
+ */
+ public void close() {
+ if (embeddingProvider instanceof AutoCloseable closeable) {
+ try {
+ closeable.close();
+ } catch (Exception e) {
+ logger.warn("Failed to close embedding provider", e);
+ }
+ }
+ }
+
private SentenceDetectorME loadSentenceDetector(Map<String, String>
configuration) {
try {
final String configuredPath = configuration.get(KEY_SENTDETECT_PATH);
@@ -120,8 +146,7 @@ public final class ModelBundleCache {
}
private Map<String, ModelBundleInfo> buildBundleCatalog() {
- final Map<String, ModelBundleInfo> catalog = new HashMap<>();
- catalog.put(ProfileRegistry.DEFAULT_BUNDLE_ID, ModelBundleInfo.newBuilder()
+ final ModelBundleInfo.Builder bundle = ModelBundleInfo.newBuilder()
.setBundleId(ProfileRegistry.DEFAULT_BUNDLE_ID)
.addSupportedLanguages(DEFAULT_LANGUAGE)
.addSupportedSteps(PipelineStep.PIPELINE_STEP_SENTENCE_DETECT)
@@ -137,8 +162,21 @@ public final class ModelBundleCache {
.setLocale(DEFAULT_LANGUAGE)
.setComponentType(ComponentType.COMPONENT_TYPE_TOKENIZER)
.addLanguages(DEFAULT_LANGUAGE)
- .build())
- .build());
+ .build());
+ if (embeddingProvider.isAvailable()) {
+ bundle.addSupportedSteps(PipelineStep.PIPELINE_STEP_EMBED);
+ for (String modelId : embeddingProvider.registeredModelIds()) {
+ bundle.addModels(ModelDescriptor.newBuilder()
+ .setName(modelId)
+ .setLocale(DEFAULT_LANGUAGE)
+ .setComponentType(ComponentType.COMPONENT_TYPE_EMBEDDER)
+ .addLanguages(DEFAULT_LANGUAGE)
+
.setEmbeddingDimension(embeddingProvider.embeddingDimension(modelId))
+ .build());
+ }
+ }
+ final Map<String, ModelBundleInfo> catalog = new HashMap<>();
+ catalog.put(ProfileRegistry.DEFAULT_BUNDLE_ID, bundle.build());
return catalog;
}
}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzer.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzer.java
index 2df3f089..a957dcdb 100644
---
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzer.java
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzer.java
@@ -18,13 +18,17 @@
package org.apache.opennlp.grpc.processor;
import java.util.ArrayList;
+import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Objects;
+import java.util.Set;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.util.Span;
+import org.apache.opennlp.grpc.chunk.ChunkEmbedProcessor;
+import org.apache.opennlp.grpc.embedding.EmbeddingProvider;
import org.apache.opennlp.grpc.model.ModelBundleCache;
import org.apache.opennlp.grpc.profile.ProfileRegistry;
import org.apache.opennlp.grpc.profile.ProfileResolver;
@@ -34,16 +38,19 @@ import org.apache.opennlp.grpc.v1.AnalyzeDocumentRequest;
import org.apache.opennlp.grpc.v1.AnalyzeDocumentResponse;
import org.apache.opennlp.grpc.v1.AnnotatedSentence;
import org.apache.opennlp.grpc.v1.AnnotationSpan;
+import org.apache.opennlp.grpc.v1.Chunk;
import org.apache.opennlp.grpc.v1.ChunkEmbedConfigEntry;
+import org.apache.opennlp.grpc.v1.ChunkEmbeddingGroup;
import org.apache.opennlp.grpc.v1.CoordinateSpace;
import org.apache.opennlp.grpc.v1.DiagnosticSeverity;
+import org.apache.opennlp.grpc.v1.EmbeddingGranularity;
+import org.apache.opennlp.grpc.v1.EmbeddingResult;
import org.apache.opennlp.grpc.v1.InferenceBackend;
import org.apache.opennlp.grpc.v1.ModelBundleRef;
import org.apache.opennlp.grpc.v1.OffsetEncoding;
import org.apache.opennlp.grpc.v1.OpenNlpDocument;
import org.apache.opennlp.grpc.v1.PipelineStep;
import org.apache.opennlp.grpc.v1.ProcessingDiagnostic;
-import org.apache.opennlp.grpc.v1.SemanticChunkingConfig;
import org.apache.opennlp.grpc.v1.Token;
/**
@@ -56,16 +63,26 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
private final ProfileResolver profileResolver;
private final ModelBundleCache modelBundleCache;
+ private final EmbeddingProvider embeddingProvider;
public BasicDocumentAnalyzer(Map<String, String> configuration) {
this(ProfileRegistry.createDefault(), new ModelBundleCache(configuration));
}
public BasicDocumentAnalyzer(ProfileRegistry profileRegistry,
ModelBundleCache modelBundleCache) {
+ this(profileRegistry, modelBundleCache,
modelBundleCache.getEmbeddingProvider());
+ }
+
+ public BasicDocumentAnalyzer(
+ ProfileRegistry profileRegistry,
+ ModelBundleCache modelBundleCache,
+ EmbeddingProvider embeddingProvider) {
Objects.requireNonNull(profileRegistry, "profileRegistry");
Objects.requireNonNull(modelBundleCache, "modelBundleCache");
+ Objects.requireNonNull(embeddingProvider, "embeddingProvider");
this.profileResolver = new ProfileResolver(profileRegistry);
this.modelBundleCache = modelBundleCache;
+ this.embeddingProvider = embeddingProvider;
}
@Override
@@ -95,7 +112,7 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
document.setMetadata(input.getMetadata());
}
- if (PipelineStepPolicy.shouldRun(profile,
PipelineStep.PIPELINE_STEP_SENTENCE_DETECT)) {
+ if (shouldRunStep(request, profile,
PipelineStep.PIPELINE_STEP_SENTENCE_DETECT)) {
runStep(
PipelineStep.PIPELINE_STEP_SENTENCE_DETECT,
diagnostics,
@@ -104,7 +121,7 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
addSkippedDiagnostic(diagnostics,
PipelineStep.PIPELINE_STEP_SENTENCE_DETECT);
}
- if (PipelineStepPolicy.shouldRun(profile,
PipelineStep.PIPELINE_STEP_TOKENIZE)) {
+ if (shouldRunStep(request, profile, PipelineStep.PIPELINE_STEP_TOKENIZE)) {
if (document.getSentencesCount() == 0) {
throw AnalysisException.failedPrecondition(
PipelineStep.PIPELINE_STEP_TOKENIZE.name()
@@ -119,6 +136,36 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
addSkippedDiagnostic(diagnostics, PipelineStep.PIPELINE_STEP_TOKENIZE);
}
+ final String embeddingModelId = resolveEmbeddingModelId(request, profile);
+ if (PipelineStepPolicy.shouldRun(profile,
PipelineStep.PIPELINE_STEP_EMBED)) {
+ if (document.getSentencesCount() == 0) {
+ throw AnalysisException.failedPrecondition(
+ PipelineStep.PIPELINE_STEP_EMBED.name()
+ + " requires "
+ + PipelineStep.PIPELINE_STEP_SENTENCE_DETECT.name());
+ }
+ runStep(
+ PipelineStep.PIPELINE_STEP_EMBED,
+ diagnostics,
+ () -> runEmbedding(rawText, document, embeddingModelId,
diagnostics));
+ } else {
+ addSkippedDiagnostic(diagnostics, PipelineStep.PIPELINE_STEP_EMBED);
+ }
+
+ if (request.getChunkEmbedConfigsCount() > 0) {
+ runStep(
+ PipelineStep.PIPELINE_STEP_CHUNK,
+ diagnostics,
+ () -> runChunkEmbedConfigs(rawText, document, request, diagnostics));
+ } else if (shouldRunStep(request, profile,
PipelineStep.PIPELINE_STEP_CHUNK)) {
+ runStep(
+ PipelineStep.PIPELINE_STEP_CHUNK,
+ diagnostics,
+ () -> runProfileChunking(rawText, document, diagnostics));
+ } else {
+ addSkippedDiagnostic(diagnostics, PipelineStep.PIPELINE_STEP_CHUNK);
+ }
+
final OffsetEncoding requestedEncoding = request.hasOptions()
? request.getOptions().getOffsetEncoding()
: OffsetEncoding.OFFSET_ENCODING_UNSPECIFIED;
@@ -130,7 +177,7 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
.build();
}
- private static void validateSupportedRequest(
+ private void validateSupportedRequest(
AnalyzeDocumentRequest request, AnalysisProfile profile, String rawText)
{
for (PipelineStep step : profile.getStepsList()) {
if (step == PipelineStep.PIPELINE_STEP_UNSPECIFIED) {
@@ -141,32 +188,71 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
}
}
- validateOptions(request, rawText);
+ validateOptions(request, profile, rawText);
validateModelBundle(profile);
+ validateEmbeddingRequest(request, profile);
+ validateChunkEmbedConfigs(request);
+ }
+ private void validateChunkEmbedConfigs(AnalyzeDocumentRequest request) {
if (request.getChunkEmbedConfigsCount() == 0) {
return;
}
for (ChunkEmbedConfigEntry entry : request.getChunkEmbedConfigsList()) {
- validateSemanticChunking(entry);
+ ChunkEmbedProcessor.validateEntry(entry, embeddingProvider);
+ }
+ }
+
+ private Set<PipelineStep> resolveEffectiveSteps(
+ AnalyzeDocumentRequest request, AnalysisProfile profile) {
+ final LinkedHashSet<PipelineStep> steps = new
LinkedHashSet<>(profile.getStepsList());
+ if (PipelineStepPolicy.shouldRun(profile,
PipelineStep.PIPELINE_STEP_EMBED)) {
+ steps.add(PipelineStep.PIPELINE_STEP_SENTENCE_DETECT);
}
- throw AnalysisException.unimplemented("chunk_embed_configs are not
implemented on this server");
+ if (request.getChunkEmbedConfigsCount() > 0) {
+ steps.add(PipelineStep.PIPELINE_STEP_SENTENCE_DETECT);
+ for (ChunkEmbedConfigEntry entry : request.getChunkEmbedConfigsList()) {
+ if (entry.hasChunking() &&
"token".equals(entry.getChunking().getAlgorithm())) {
+ steps.add(PipelineStep.PIPELINE_STEP_TOKENIZE);
+ }
+ }
+ }
+ if (PipelineStepPolicy.shouldRun(profile, PipelineStep.PIPELINE_STEP_CHUNK)
+ && request.getChunkEmbedConfigsCount() == 0) {
+ steps.add(PipelineStep.PIPELINE_STEP_SENTENCE_DETECT);
+ }
+ return steps;
+ }
+
+ private boolean shouldRunStep(
+ AnalyzeDocumentRequest request, AnalysisProfile profile, PipelineStep
step) {
+ return resolveEffectiveSteps(request, profile).contains(step);
}
- private static void validateOptions(AnalyzeDocumentRequest request, String
rawText) {
+ private void validateOptions(
+ AnalyzeDocumentRequest request, AnalysisProfile profile, String rawText)
{
if (!request.hasOptions()) {
return;
}
final AnalysisOptions options = request.getOptions();
final InferenceBackend backend = options.getInferenceBackend();
+ final boolean embedRequested =
+ PipelineStepPolicy.shouldRun(profile,
PipelineStep.PIPELINE_STEP_EMBED);
+ final boolean chunkEmbedsRequested =
request.getChunkEmbedConfigsList().stream()
+ .anyMatch(entry -> entry.getEmbeddingModelIdsCount() > 0);
+ final boolean dlRequested = embedRequested || chunkEmbedsRequested;
if (backend != InferenceBackend.INFERENCE_BACKEND_UNSPECIFIED
- && backend != InferenceBackend.INFERENCE_BACKEND_OPENNLP_ME) {
+ && backend != InferenceBackend.INFERENCE_BACKEND_OPENNLP_ME
+ && !(dlRequested &&
embeddingProvider.supportsInferenceBackend(backend))) {
throw AnalysisException.unimplemented(
- "inference_backend " + backend.name() + " is not implemented; only
OPENNLP_ME is supported");
+ "inference_backend " + backend.name()
+ + " is not implemented for the configured embedding provider");
}
if (options.hasOnnxEmbeddingModelId() &&
!options.getOnnxEmbeddingModelId().isBlank()) {
- throw AnalysisException.unimplemented(
- "onnx_embedding_model_id is not implemented (no EMBED step on this
server)");
+ if (!embedRequested) {
+ throw AnalysisException.invalidArgument(
+ "onnx_embedding_model_id requires PIPELINE_STEP_EMBED in the
analysis profile");
+ }
}
if (options.hasMaxTextLength()
&& options.getMaxTextLength() > 0
@@ -176,6 +262,81 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
}
}
+ private void validateEmbeddingRequest(AnalyzeDocumentRequest request,
AnalysisProfile profile) {
+ if (!PipelineStepPolicy.shouldRun(profile,
PipelineStep.PIPELINE_STEP_EMBED)) {
+ return;
+ }
+ if (!embeddingProvider.isAvailable()) {
+ throw AnalysisException.notFound(
+ "PIPELINE_STEP_EMBED requested but no embedding models are
configured on this server");
+ }
+ final String modelId = resolveEmbeddingModelId(request, profile);
+ if (modelId == null || modelId.isBlank()) {
+ throw AnalysisException.invalidArgument(
+ "onnx_embedding_model_id is required when multiple embedding models
are configured");
+ }
+ if (!embeddingProvider.supportsModel(modelId)) {
+ throw AnalysisException.notFound("Unknown embedding model '" + modelId +
"'");
+ }
+ }
+
+ private String resolveEmbeddingModelId(AnalyzeDocumentRequest request,
AnalysisProfile profile) {
+ if (!PipelineStepPolicy.shouldRun(profile,
PipelineStep.PIPELINE_STEP_EMBED)) {
+ return null;
+ }
+ String requested = null;
+ if (request.hasOptions() &&
request.getOptions().hasOnnxEmbeddingModelId()) {
+ requested = request.getOptions().getOnnxEmbeddingModelId();
+ }
+ return embeddingProvider.resolveModelId(requested);
+ }
+
+ private void runChunkEmbedConfigs(
+ String rawText,
+ OpenNlpDocument.Builder document,
+ AnalyzeDocumentRequest request,
+ List<ProcessingDiagnostic> diagnostics) {
+ if (document.getSentencesCount() == 0) {
+ throw AnalysisException.failedPrecondition(
+ "chunk_embed_configs requires sentence detection backbone");
+ }
+ for (ChunkEmbedConfigEntry entry : request.getChunkEmbedConfigsList()) {
+ if ("token".equals(entry.getChunking().getAlgorithm())) {
+ ensureTokenized(document);
+ }
+ final ChunkEmbeddingGroup group =
+ ChunkEmbedProcessor.buildGroup(rawText, document.build(), entry,
embeddingProvider);
+ document.addChunkEmbeddingGroups(group);
+ diagnostics.add(ChunkEmbedProcessor.successDiagnostic(
+ entry.getConfigId(), group.getChunksCount()));
+ }
+ }
+
+ private void runProfileChunking(
+ String rawText,
+ OpenNlpDocument.Builder document,
+ List<ProcessingDiagnostic> diagnostics) {
+ if (document.getSentencesCount() == 0) {
+ throw AnalysisException.failedPrecondition(
+ PipelineStep.PIPELINE_STEP_CHUNK.name()
+ + " requires "
+ + PipelineStep.PIPELINE_STEP_SENTENCE_DETECT.name());
+ }
+ final ChunkEmbeddingGroup group =
+ ChunkEmbedProcessor.buildSentenceGroup(rawText, document.build(),
"profile-chunk");
+ document.addChunkEmbeddingGroups(group);
+ diagnostics.add(ChunkEmbedProcessor.successDiagnostic("profile-chunk",
group.getChunksCount()));
+ }
+
+ private static void ensureTokenized(OpenNlpDocument.Builder document) {
+ for (AnnotatedSentence sentence : document.getSentencesList()) {
+ if (sentence.getTokensCount() == 0) {
+ throw AnalysisException.failedPrecondition(
+ "token chunking requires " +
PipelineStep.PIPELINE_STEP_TOKENIZE.name());
+ }
+ }
+ }
+
private static void validateModelBundle(AnalysisProfile profile) {
if (!profile.hasModelBundle()) {
return;
@@ -193,21 +354,6 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
}
}
- private static void validateSemanticChunking(ChunkEmbedConfigEntry entry) {
- if (!entry.hasChunking() || !entry.getChunking().hasSemanticConfig()) {
- return;
- }
- final SemanticChunkingConfig semantic =
entry.getChunking().getSemanticConfig();
- if (semantic.hasSemanticEmbeddingModelId() &&
!semantic.getSemanticEmbeddingModelId().isBlank()) {
- return;
- }
- if (entry.getEmbeddingModelIdsCount() == 1) {
- return;
- }
- throw AnalysisException.invalidArgument(
- "semantic chunking requires semantic_embedding_model_id or exactly one
embedding_model_id");
- }
-
private void runStep(
PipelineStep step,
List<ProcessingDiagnostic> diagnostics,
@@ -281,6 +427,40 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
.build());
}
+ private void runEmbedding(
+ String rawText,
+ OpenNlpDocument.Builder document,
+ String modelId,
+ List<ProcessingDiagnostic> diagnostics) {
+ int embeddingCount = 0;
+ for (AnnotatedSentence sentence : document.getSentencesList()) {
+ final AnnotationSpan sentenceSpan = sentence.getSentenceSpan();
+ final String sentenceText = rawText.substring(sentenceSpan.getStart(),
sentenceSpan.getEnd());
+ final float[] vector = embeddingProvider.embed(modelId, sentenceText);
+ document.addEmbeddings(EmbeddingResult.newBuilder()
+ .setModelId(modelId)
+ .addAllVector(toFloatList(vector))
+ .setSourceSpan(sentenceSpan)
+ .setGranularity(EmbeddingGranularity.EMBEDDING_GRANULARITY_SENTENCE)
+ .build());
+ embeddingCount++;
+ }
+ diagnostics.add(ProcessingDiagnostic.newBuilder()
+ .setStep(PipelineStep.PIPELINE_STEP_EMBED)
+ .setSeverity(DiagnosticSeverity.DIAGNOSTIC_SEVERITY_INFO)
+ .setMessage("Generated " + embeddingCount + " sentence embedding(s)
with model '"
+ + modelId + "'")
+ .build());
+ }
+
+ private static List<Float> toFloatList(float[] vector) {
+ final List<Float> values = new ArrayList<>(vector.length);
+ for (float value : vector) {
+ values.add(value);
+ }
+ return values;
+ }
+
/**
* Converts every span in the document from Java UTF-16 indices to the
requested
* {@link OffsetEncoding} and records the chosen encoding on the document.
@@ -298,6 +478,27 @@ public class BasicDocumentAnalyzer implements
DocumentAnalyzer {
}
document.setSentences(i, sentence.build());
}
+ for (int e = 0; e < document.getEmbeddingsCount(); e++) {
+ final EmbeddingResult embedding = document.getEmbeddings(e);
+ document.setEmbeddings(e, embedding.toBuilder()
+ .setSourceSpan(remap(embedding.getSourceSpan(), mapper))
+ .build());
+ }
+ for (int g = 0; g < document.getChunkEmbeddingGroupsCount(); g++) {
+ final ChunkEmbeddingGroup.Builder group =
document.getChunkEmbeddingGroups(g).toBuilder();
+ for (int c = 0; c < group.getChunksCount(); c++) {
+ final Chunk.Builder chunk = group.getChunks(c).toBuilder();
+ chunk.setAnnotationSpan(remap(chunk.getAnnotationSpan(), mapper));
+ for (int e = 0; e < chunk.getEmbeddingsCount(); e++) {
+ final EmbeddingResult embedding = chunk.getEmbeddings(e);
+ chunk.setEmbeddings(e, embedding.toBuilder()
+ .setSourceSpan(remap(embedding.getSourceSpan(), mapper))
+ .build());
+ }
+ group.setChunks(c, chunk.build());
+ }
+ document.setChunkEmbeddingGroups(g, group.build());
+ }
document.setOffsetEncoding(mapper.encoding());
}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/processor/PipelineStepPolicy.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/processor/PipelineStepPolicy.java
index bbf5954c..0cc8836a 100644
---
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/processor/PipelineStepPolicy.java
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/processor/PipelineStepPolicy.java
@@ -31,7 +31,9 @@ public final class PipelineStepPolicy {
/** Steps implemented by the current processor, in execution order. */
private static final List<PipelineStep> IMPLEMENTED_STEPS = List.of(
PipelineStep.PIPELINE_STEP_SENTENCE_DETECT,
- PipelineStep.PIPELINE_STEP_TOKENIZE);
+ PipelineStep.PIPELINE_STEP_TOKENIZE,
+ PipelineStep.PIPELINE_STEP_CHUNK,
+ PipelineStep.PIPELINE_STEP_EMBED);
private static final Set<PipelineStep> IMPLEMENTED_STEP_SET =
Set.copyOf(IMPLEMENTED_STEPS);
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/server/OpenNlpGrpcServer.java
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/server/OpenNlpGrpcServer.java
index 111a66bb..dd83f85b 100644
---
a/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/server/OpenNlpGrpcServer.java
+++
b/opennlp-grpc/opennlp-grpc-service/src/main/java/org/apache/opennlp/grpc/server/OpenNlpGrpcServer.java
@@ -114,7 +114,7 @@ public class OpenNlpGrpcServer implements Callable<Integer>
{
this.server.start();
logger.info("Started OpenNlpGrpcServer on port {}", server.getPort());
- registerShutdownHook();
+ registerShutdownHook(modelBundleCache);
}
public void awaitTermination() throws InterruptedException {
@@ -149,13 +149,14 @@ public class OpenNlpGrpcServer implements
Callable<Integer> {
return configuration;
}
- private void registerShutdownHook() {
+ private void registerShutdownHook(ModelBundleCache modelBundleCache) {
Runtime.getRuntime()
.addShutdownHook(
new Thread(
() -> {
try {
stop();
+ modelBundleCache.close();
} catch (Exception e) {
logger.error(
"Error when trying to shutdown a lifecycle component:
{}",
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/ChunkEmbedProcessorSemanticTest.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/ChunkEmbedProcessorSemanticTest.java
new file mode 100644
index 00000000..18b783e4
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/ChunkEmbedProcessorSemanticTest.java
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.chunk;
+
+import java.util.Map;
+
+import org.apache.opennlp.grpc.embedding.StubEmbeddingProvider;
+import org.apache.opennlp.grpc.v1.AnnotatedSentence;
+import org.apache.opennlp.grpc.v1.AnnotationSpan;
+import org.apache.opennlp.grpc.v1.ChunkEmbedConfigEntry;
+import org.apache.opennlp.grpc.v1.ChunkingSpec;
+import org.apache.opennlp.grpc.v1.CoordinateSpace;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.SemanticChunkingConfig;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+class ChunkEmbedProcessorSemanticTest {
+
+ private static final float[] TOPIC_A = {1f, 0f, 0f};
+ private static final float[] TOPIC_B = {0f, 1f, 0f};
+
+ private final StubEmbeddingProvider provider = new StubEmbeddingProvider(
+ Map.of("minilm", 3),
+ (modelId, text) -> text.startsWith("A") ? TOPIC_A : TOPIC_B,
+ java.util.Set.of());
+
+ @Test
+ void buildsSemanticGroupWithEmbeddings() {
+ final String rawText = "Aa.Ab.Bc.";
+ final OpenNlpDocument document = OpenNlpDocument.newBuilder()
+ .setRawText(rawText)
+ .addSentences(sentence(0, 3))
+ .addSentences(sentence(3, 6))
+ .addSentences(sentence(6, 9))
+ .build();
+ final ChunkEmbedConfigEntry entry = ChunkEmbedConfigEntry.newBuilder()
+ .setConfigId("semantic-topics")
+ .setChunking(ChunkingSpec.newBuilder()
+ .setAlgorithm("semantic")
+ .setSemanticConfig(SemanticChunkingConfig.newBuilder()
+ .setSimilarityThreshold(0.9f)
+ .setSemanticEmbeddingModelId("minilm")
+ .build())
+ .build())
+ .addEmbeddingModelIds("minilm")
+ .build();
+
+ final var group = ChunkEmbedProcessor.buildGroup(rawText, document, entry,
provider);
+
+ assertEquals(2, group.getChunksCount());
+ assertEquals(1, group.getChunks(0).getEmbeddingsCount());
+ }
+
+ private static AnnotatedSentence sentence(int start, int end) {
+ return AnnotatedSentence.newBuilder()
+ .setSentenceSpan(AnnotationSpan.newBuilder()
+ .setStart(start)
+ .setEnd(end)
+ .setSpace(CoordinateSpace.COORDINATE_SPACE_CHAR_DOCUMENT)
+ .build())
+ .build();
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/SegmentationChunkerTest.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/SegmentationChunkerTest.java
new file mode 100644
index 00000000..8ef4a67e
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/SegmentationChunkerTest.java
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.chunk;
+
+import java.util.Map;
+
+import org.apache.opennlp.grpc.embedding.EmbeddingProvider;
+import org.apache.opennlp.grpc.embedding.StubEmbeddingProvider;
+import org.apache.opennlp.grpc.v1.AnnotatedSentence;
+import org.apache.opennlp.grpc.v1.AnnotationSpan;
+import org.apache.opennlp.grpc.v1.ChunkingSpec;
+import org.apache.opennlp.grpc.v1.CoordinateSpace;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.Token;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+class SegmentationChunkerTest {
+
+ private static final EmbeddingProvider NO_MODELS = new
StubEmbeddingProvider(Map.of());
+
+ @Test
+ void sentenceAlgorithmCreatesOneChunkPerSentence() {
+ final OpenNlpDocument document = OpenNlpDocument.newBuilder()
+ .setRawText("One. Two!")
+ .addSentences(sentence(0, 4))
+ .addSentences(sentence(5, 9))
+ .build();
+
+ final var chunks = SegmentationChunker.segment(document.getRawText(),
document,
+ ChunkingSpec.newBuilder().setAlgorithm("sentence").build(), NO_MODELS);
+
+ assertEquals(2, chunks.size());
+ assertEquals(0, chunks.get(0).start());
+ assertEquals(4, chunks.get(0).end());
+ assertEquals(1, chunks.get(1).sentenceIndices().size());
+ }
+
+ @Test
+ void tokenAlgorithmCreatesOverlappingWindows() {
+ final OpenNlpDocument document = OpenNlpDocument.newBuilder()
+ .setRawText("a b c d e")
+ .addSentences(AnnotatedSentence.newBuilder()
+ .setSentenceSpan(span(0, 9))
+ .addTokens(token("a", 0, 1))
+ .addTokens(token("b", 2, 3))
+ .addTokens(token("c", 4, 5))
+ .addTokens(token("d", 6, 7))
+ .addTokens(token("e", 8, 9))
+ .build())
+ .build();
+
+ final var chunks = SegmentationChunker.segment(document.getRawText(),
document,
+ ChunkingSpec.newBuilder()
+ .setAlgorithm("token")
+ .setChunkSize(3)
+ .setChunkOverlap(1)
+ .build(),
+ NO_MODELS);
+
+ assertEquals(2, chunks.size());
+ assertEquals(0, chunks.get(0).start());
+ assertEquals(5, chunks.get(0).end());
+ assertEquals(4, chunks.get(1).start());
+ assertEquals(9, chunks.get(1).end());
+ }
+
+ private static AnnotatedSentence sentence(int start, int end) {
+ return AnnotatedSentence.newBuilder()
+ .setSentenceSpan(span(start, end))
+ .build();
+ }
+
+ private static Token token(String text, int start, int end) {
+ return Token.newBuilder()
+ .setText(text)
+ .setAnnotationSpan(span(start, end))
+ .build();
+ }
+
+ private static AnnotationSpan span(int start, int end) {
+ return AnnotationSpan.newBuilder()
+ .setStart(start)
+ .setEnd(end)
+ .setSpace(CoordinateSpace.COORDINATE_SPACE_CHAR_DOCUMENT)
+ .build();
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/SemanticChunkerTest.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/SemanticChunkerTest.java
new file mode 100644
index 00000000..bb0a7147
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/chunk/SemanticChunkerTest.java
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.chunk;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.opennlp.grpc.embedding.StubEmbeddingProvider;
+import org.apache.opennlp.grpc.v1.AnnotatedSentence;
+import org.apache.opennlp.grpc.v1.AnnotationSpan;
+import org.apache.opennlp.grpc.v1.ChunkingSpec;
+import org.apache.opennlp.grpc.v1.CoordinateSpace;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.SemanticChunkingConfig;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+class SemanticChunkerTest {
+
+ private static final float[] TOPIC_A = {1f, 0f, 0f};
+ private static final float[] TOPIC_B = {0f, 1f, 0f};
+
+ private final StubEmbeddingProvider provider = new StubEmbeddingProvider(
+ Map.of("minilm", 3),
+ (modelId, text) -> text.startsWith("A") ? TOPIC_A : TOPIC_B,
+ Set.of());
+
+ @Test
+ void splitsWhenAdjacentSentenceSimilarityIsLow() {
+ final OpenNlpDocument document = OpenNlpDocument.newBuilder()
+ .setRawText("Aa.Ab.Bc.")
+ .addSentences(sentence(0, 3))
+ .addSentences(sentence(3, 6))
+ .addSentences(sentence(6, 9))
+ .build();
+
+ final var chunks = SemanticChunker.chunk(
+ document.getRawText(),
+ document,
+
SemanticChunkingConfig.newBuilder().setSimilarityThreshold(0.9f).build(),
+ provider,
+ "minilm");
+
+ assertEquals(2, chunks.size());
+ assertEquals(List.of(0, 1), chunks.get(0).sentenceIndices());
+ assertEquals(List.of(2), chunks.get(1).sentenceIndices());
+ }
+
+ @Test
+ void mergesUndersizedTrailingChunkIntoPrecedingChunk() {
+ final OpenNlpDocument document = OpenNlpDocument.newBuilder()
+ .setRawText("Aa.Ab.Bc.")
+ .addSentences(sentence(0, 3))
+ .addSentences(sentence(3, 6))
+ .addSentences(sentence(6, 9))
+ .build();
+
+ final var chunks = SemanticChunker.chunk(
+ document.getRawText(),
+ document,
+ SemanticChunkingConfig.newBuilder()
+ .setSimilarityThreshold(0.9f)
+ .setMinChunkSentences(2)
+ .build(),
+ provider,
+ "minilm");
+
+ assertEquals(1, chunks.size());
+ assertEquals(List.of(0, 1, 2), chunks.get(0).sentenceIndices());
+ }
+
+ @Test
+ void mergesUndersizedLeadingChunkWithFollowingChunk() {
+ final OpenNlpDocument document = OpenNlpDocument.newBuilder()
+ .setRawText("Ba.Ab.Ac.")
+ .addSentences(sentence(0, 3))
+ .addSentences(sentence(3, 6))
+ .addSentences(sentence(6, 9))
+ .build();
+
+ final var chunks = SemanticChunker.chunk(
+ document.getRawText(),
+ document,
+ SemanticChunkingConfig.newBuilder()
+ .setSimilarityThreshold(0.9f)
+ .setMinChunkSentences(2)
+ .build(),
+ provider,
+ "minilm");
+
+ assertEquals(1, chunks.size());
+ assertEquals(List.of(0, 1, 2), chunks.get(0).sentenceIndices());
+ }
+
+ @Test
+ void splitsChunksLargerThanMaxChunkSentences() {
+ final OpenNlpDocument document = OpenNlpDocument.newBuilder()
+ .setRawText("Aa.Ab.Ac.Ad.")
+ .addSentences(sentence(0, 3))
+ .addSentences(sentence(3, 6))
+ .addSentences(sentence(6, 9))
+ .addSentences(sentence(9, 12))
+ .build();
+
+ final var chunks = SemanticChunker.chunk(
+ document.getRawText(),
+ document,
+ SemanticChunkingConfig.newBuilder()
+ .setSimilarityThreshold(0.9f)
+ .setMaxChunkSentences(2)
+ .build(),
+ provider,
+ "minilm");
+
+ assertEquals(2, chunks.size());
+ assertEquals(List.of(0, 1), chunks.get(0).sentenceIndices());
+ assertEquals(List.of(2, 3), chunks.get(1).sentenceIndices());
+ }
+
+ @Test
+ void cosineSimilarityIsOneForIdenticalVectors() {
+ assertEquals(1f, SemanticChunker.cosineSimilarity(TOPIC_A, TOPIC_A),
0.0001f);
+ }
+
+ private static AnnotatedSentence sentence(int start, int end) {
+ return AnnotatedSentence.newBuilder()
+ .setSentenceSpan(AnnotationSpan.newBuilder()
+ .setStart(start)
+ .setEnd(end)
+ .setSpace(CoordinateSpace.COORDINATE_SPACE_CHAR_DOCUMENT)
+ .build())
+ .build();
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/embedding/EmbeddingProviderFactoryTest.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/embedding/EmbeddingProviderFactoryTest.java
new file mode 100644
index 00000000..167e02dd
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/embedding/EmbeddingProviderFactoryTest.java
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.embedding;
+
+import java.util.Map;
+
+import org.apache.opennlp.grpc.processor.AnalysisException;
+import org.apache.opennlp.grpc.v1.InferenceBackend;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertInstanceOf;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+class EmbeddingProviderFactoryTest {
+
+ @Test
+ void defaultsToCpuProvider() {
+ final EmbeddingProvider provider =
EmbeddingProviderFactory.create(Map.of());
+ assertInstanceOf(OnnxRuntimeEmbeddingProvider.class, provider);
+
assertTrue(provider.supportsInferenceBackend(InferenceBackend.INFERENCE_BACKEND_ONNX_RUNTIME));
+
assertFalse(provider.supportsInferenceBackend(InferenceBackend.INFERENCE_BACKEND_CUDA));
+ }
+
+ @Test
+ void selectsCudaProviderFromConfig() {
+ final EmbeddingProvider provider =
+ EmbeddingProviderFactory.create(Map.of("model.embedder.backend",
"cuda"));
+ assertInstanceOf(CudaEmbeddingProvider.class, provider);
+
assertTrue(provider.supportsInferenceBackend(InferenceBackend.INFERENCE_BACKEND_CUDA));
+
assertTrue(provider.supportsInferenceBackend(InferenceBackend.INFERENCE_BACKEND_ONNX_RUNTIME_GPU));
+ }
+
+ @Test
+ void rejectsUnknownBackend() {
+ final AnalysisException e = assertThrows(AnalysisException.class,
+ () -> EmbeddingProviderFactory.create(Map.of("model.embedder.backend",
"openvino")));
+ assertEquals(AnalysisException.FailureType.INVALID_ARGUMENT,
e.getFailureType());
+ }
+
+ @Test
+ void rejectsGpuDeviceIdWithoutCudaBackend() {
+ final AnalysisException e = assertThrows(AnalysisException.class,
+ () ->
EmbeddingProviderFactory.create(Map.of("model.embedder.gpu_device_id", "1")));
+ assertEquals(AnalysisException.FailureType.INVALID_ARGUMENT,
e.getFailureType());
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/embedding/StubEmbeddingProvider.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/embedding/StubEmbeddingProvider.java
new file mode 100644
index 00000000..8e3ed858
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/embedding/StubEmbeddingProvider.java
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.embedding;
+
+import java.util.Map;
+import java.util.Set;
+import java.util.function.BiFunction;
+
+import org.apache.opennlp.grpc.v1.InferenceBackend;
+
+/**
+ * Test double returning deterministic or caller-supplied vectors.
+ */
+public final class StubEmbeddingProvider implements EmbeddingProvider {
+
+ private final Map<String, Integer> dimensions;
+ private final BiFunction<String, String, float[]> embedFn;
+ private final Set<InferenceBackend> backends;
+
+ public StubEmbeddingProvider(Map<String, Integer> dimensions) {
+ this(dimensions, null,
Set.of(InferenceBackend.INFERENCE_BACKEND_ONNX_RUNTIME));
+ }
+
+ public StubEmbeddingProvider(
+ Map<String, Integer> dimensions,
+ BiFunction<String, String, float[]> embedFn,
+ Set<InferenceBackend> backends) {
+ this.dimensions = Map.copyOf(dimensions);
+ this.embedFn = embedFn;
+ this.backends = Set.copyOf(backends);
+ }
+
+ @Override
+ public boolean isAvailable() {
+ return !dimensions.isEmpty();
+ }
+
+ @Override
+ public Set<String> registeredModelIds() {
+ return dimensions.keySet();
+ }
+
+ @Override
+ public boolean supportsModel(String modelId) {
+ return dimensions.containsKey(modelId);
+ }
+
+ @Override
+ public int embeddingDimension(String modelId) {
+ return dimensions.getOrDefault(modelId, 0);
+ }
+
+ @Override
+ public float[] embed(String modelId, String text) {
+ if (embedFn != null) {
+ return embedFn.apply(modelId, text);
+ }
+ final int dimension = embeddingDimension(modelId);
+ final float[] vector = new float[dimension];
+ final int seed = (modelId + ":" + text).hashCode();
+ for (int i = 0; i < dimension; i++) {
+ vector[i] = (seed + i) * 0.001f;
+ }
+ return vector;
+ }
+
+ @Override
+ public boolean supportsInferenceBackend(InferenceBackend backend) {
+ return backend == InferenceBackend.INFERENCE_BACKEND_UNSPECIFIED
+ || backend == InferenceBackend.INFERENCE_BACKEND_OPENNLP_ME
+ || backend == InferenceBackend.INFERENCE_BACKEND_ONNX_RUNTIME
+ || backends.contains(backend);
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerChunkEmbedTest.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerChunkEmbedTest.java
new file mode 100644
index 00000000..45ed0391
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerChunkEmbedTest.java
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.processor;
+
+import java.util.Map;
+
+import org.apache.opennlp.grpc.embedding.StubEmbeddingProvider;
+import org.apache.opennlp.grpc.model.ModelBundleCache;
+import org.apache.opennlp.grpc.profile.ProfileRegistry;
+import org.apache.opennlp.grpc.v1.AnalysisProfile;
+import org.apache.opennlp.grpc.v1.AnalyzeDocumentRequest;
+import org.apache.opennlp.grpc.v1.ChunkEmbedConfigEntry;
+import org.apache.opennlp.grpc.v1.ChunkingSpec;
+import org.apache.opennlp.grpc.v1.EmbeddingGranularity;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.PipelineStep;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+class BasicDocumentAnalyzerChunkEmbedTest {
+
+ private static final String TEXT = "First sentence. Second sentence!";
+
+ private final ModelBundleCache modelBundleCache = new
ModelBundleCache(Map.of());
+ private final StubEmbeddingProvider embeddingProvider =
+ new StubEmbeddingProvider(Map.of("minilm", 3, "e5", 3));
+ private final BasicDocumentAnalyzer analyzer = new BasicDocumentAnalyzer(
+ ProfileRegistry.createDefault(), modelBundleCache, embeddingProvider);
+
+ @Test
+ void chunkEmbedConfigsProduceGroupsWithEmbeddings() {
+ final var response = analyzer.analyze(AnalyzeDocumentRequest.newBuilder()
+ .setDocument(OpenNlpDocument.newBuilder().setRawText(TEXT).build())
+ .addChunkEmbedConfigs(ChunkEmbedConfigEntry.newBuilder()
+ .setConfigId("sentence-chunks")
+
.setChunking(ChunkingSpec.newBuilder().setAlgorithm("sentence").build())
+ .addEmbeddingModelIds("minilm")
+ .addEmbeddingModelIds("e5")
+ .build())
+ .build());
+
+ assertEquals(2, response.getDocument().getSentencesCount());
+ assertEquals(1, response.getDocument().getChunkEmbeddingGroupsCount());
+ final var group = response.getDocument().getChunkEmbeddingGroups(0);
+ assertEquals("sentence-chunks", group.getGroupId());
+ assertEquals(2, group.getChunksCount());
+ assertEquals(2, group.getChunks(0).getEmbeddingsCount());
+ assertEquals("minilm", group.getChunks(0).getEmbeddings(0).getModelId());
+ assertEquals(
+ EmbeddingGranularity.EMBEDDING_GRANULARITY_CHUNK_LEVEL,
+ group.getChunks(0).getEmbeddings(0).getGranularity());
+ assertTrue(group.getStats().getChunkCount() > 0);
+ }
+
+ @Test
+ void profileChunkStepProducesSentenceGroupsWithoutEmbeddings() {
+ final var response = analyzer.analyze(AnalyzeDocumentRequest.newBuilder()
+ .setDocument(OpenNlpDocument.newBuilder().setRawText(TEXT).build())
+ .setProfile(AnalysisProfile.newBuilder()
+ .setProfileId("chunk-only")
+ .addSteps(PipelineStep.PIPELINE_STEP_SENTENCE_DETECT)
+ .addSteps(PipelineStep.PIPELINE_STEP_CHUNK)
+ .build())
+ .build());
+
+ assertEquals(1, response.getDocument().getChunkEmbeddingGroupsCount());
+ assertEquals(2,
response.getDocument().getChunkEmbeddingGroups(0).getChunksCount());
+ assertEquals(0,
response.getDocument().getChunkEmbeddingGroups(0).getChunks(0).getEmbeddingsCount());
+ }
+
+ @Test
+ void tokenChunkingAutoRunsTokenizationBackbone() {
+ final var response = analyzer.analyze(AnalyzeDocumentRequest.newBuilder()
+ .setDocument(OpenNlpDocument.newBuilder().setRawText("one two three
four five").build())
+ .addChunkEmbedConfigs(ChunkEmbedConfigEntry.newBuilder()
+ .setConfigId("token-chunks")
+ .setChunking(ChunkingSpec.newBuilder()
+ .setAlgorithm("token")
+ .setChunkSize(2)
+ .setChunkOverlap(0)
+ .build())
+ .addEmbeddingModelIds("minilm")
+ .build())
+ .build());
+
+ assertTrue(response.getDocument().getSentences(0).getTokensCount() > 0);
+ assertEquals(3,
response.getDocument().getChunkEmbeddingGroups(0).getChunksCount());
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerEmbeddingTest.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerEmbeddingTest.java
new file mode 100644
index 00000000..69f25d27
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerEmbeddingTest.java
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.processor;
+
+import java.util.Map;
+
+import org.apache.opennlp.grpc.embedding.StubEmbeddingProvider;
+import org.apache.opennlp.grpc.model.ModelBundleCache;
+import org.apache.opennlp.grpc.profile.ProfileRegistry;
+import org.apache.opennlp.grpc.v1.AnalysisOptions;
+import org.apache.opennlp.grpc.v1.AnalysisProfile;
+import org.apache.opennlp.grpc.v1.AnalyzeDocumentRequest;
+import org.apache.opennlp.grpc.v1.EmbeddingGranularity;
+import org.apache.opennlp.grpc.v1.InferenceBackend;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.PipelineStep;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+class BasicDocumentAnalyzerEmbeddingTest {
+
+ private static final String TEXT = "One sentence. Two sentences!";
+
+ private final ModelBundleCache modelBundleCache = new
ModelBundleCache(Map.of());
+ private final StubEmbeddingProvider embeddingProvider =
+ new StubEmbeddingProvider(Map.of("minilm", 4));
+ private final BasicDocumentAnalyzer analyzer = new BasicDocumentAnalyzer(
+ ProfileRegistry.createDefault(), modelBundleCache, embeddingProvider);
+
+ @Test
+ void generatesSentenceEmbeddingsWhenEmbedStepRequested() {
+ final var response = analyzer.analyze(AnalyzeDocumentRequest.newBuilder()
+ .setDocument(OpenNlpDocument.newBuilder().setRawText(TEXT).build())
+ .setProfile(AnalysisProfile.newBuilder()
+ .setProfileId("with-embed")
+ .addSteps(PipelineStep.PIPELINE_STEP_SENTENCE_DETECT)
+ .addSteps(PipelineStep.PIPELINE_STEP_TOKENIZE)
+ .addSteps(PipelineStep.PIPELINE_STEP_EMBED)
+ .build())
+ .setOptions(AnalysisOptions.newBuilder()
+ .setOnnxEmbeddingModelId("minilm")
+
.setInferenceBackend(InferenceBackend.INFERENCE_BACKEND_ONNX_RUNTIME)
+ .build())
+ .build());
+
+ assertEquals(2, response.getDocument().getSentencesCount());
+ assertEquals(2, response.getDocument().getEmbeddingsCount());
+ assertEquals("minilm",
response.getDocument().getEmbeddings(0).getModelId());
+ assertEquals(4, response.getDocument().getEmbeddings(0).getVectorCount());
+ assertEquals(
+ EmbeddingGranularity.EMBEDDING_GRANULARITY_SENTENCE,
+ response.getDocument().getEmbeddings(0).getGranularity());
+ assertTrue(response.getDiagnosticsList().stream()
+ .anyMatch(d -> d.getStep() == PipelineStep.PIPELINE_STEP_EMBED));
+ }
+
+ @Test
+ void rejectsUnknownEmbeddingModel() {
+ final AnalysisException error = assertThrows(AnalysisException.class, ()
-> analyzer.analyze(
+ AnalyzeDocumentRequest.newBuilder()
+ .setDocument(OpenNlpDocument.newBuilder().setRawText(TEXT).build())
+ .setProfile(AnalysisProfile.newBuilder()
+ .setProfileId("with-embed")
+ .addSteps(PipelineStep.PIPELINE_STEP_SENTENCE_DETECT)
+ .addSteps(PipelineStep.PIPELINE_STEP_TOKENIZE)
+ .addSteps(PipelineStep.PIPELINE_STEP_EMBED)
+ .build())
+
.setOptions(AnalysisOptions.newBuilder().setOnnxEmbeddingModelId("missing").build())
+ .build()));
+
+ assertEquals(AnalysisException.FailureType.NOT_FOUND,
error.getFailureType());
+ }
+}
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerPolicyTest.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerPolicyTest.java
index da40e5ec..d5096aca 100644
---
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerPolicyTest.java
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerPolicyTest.java
@@ -80,18 +80,26 @@ class BasicDocumentAnalyzerPolicyTest {
}
@Test
- void rejectsChunkEmbedConfigs() {
+ void rejectsSemanticChunkEmbedConfigsWithoutEmbeddingModel() {
final BasicDocumentAnalyzer analyzer = new BasicDocumentAnalyzer(Map.of());
final AnalysisException error = assertThrows(AnalysisException.class, ()
-> analyzer.analyze(
AnalyzeDocumentRequest.newBuilder()
.setDocument(OpenNlpDocument.newBuilder().setRawText("Hello
world.").build())
.addChunkEmbedConfigs(ChunkEmbedConfigEntry.newBuilder()
- .setConfigId("token-chunks")
+ .setConfigId("semantic")
+ .setChunking(ChunkingSpec.newBuilder()
+ .setAlgorithm("semantic")
+ .setSemanticConfig(SemanticChunkingConfig.newBuilder()
+ .setSimilarityThreshold(0.5f)
+ .build())
+ .build())
+ .addEmbeddingModelIds("minilm")
+ .addEmbeddingModelIds("e5")
.build())
.build()));
- assertEquals(AnalysisException.FailureType.UNIMPLEMENTED,
error.getFailureType());
+ assertEquals(AnalysisException.FailureType.INVALID_ARGUMENT,
error.getFailureType());
}
@Test
@@ -143,7 +151,7 @@ class BasicDocumentAnalyzerPolicyTest {
}
@Test
- void rejectsOnnxEmbeddingModelId() {
+ void rejectsOnnxEmbeddingModelIdWithoutEmbedStep() {
final BasicDocumentAnalyzer analyzer = new BasicDocumentAnalyzer(Map.of());
final AnalysisException error = assertThrows(AnalysisException.class, ()
-> analyzer.analyze(
@@ -152,7 +160,25 @@ class BasicDocumentAnalyzerPolicyTest {
.setOptions(AnalysisOptions.newBuilder().setOnnxEmbeddingModelId("minilm").build())
.build()));
- assertEquals(AnalysisException.FailureType.UNIMPLEMENTED,
error.getFailureType());
+ assertEquals(AnalysisException.FailureType.INVALID_ARGUMENT,
error.getFailureType());
+ }
+
+ @Test
+ void rejectsEmbedStepWhenNoModelsConfigured() {
+ final BasicDocumentAnalyzer analyzer = new BasicDocumentAnalyzer(Map.of());
+
+ final AnalysisException error = assertThrows(AnalysisException.class, ()
-> analyzer.analyze(
+ AnalyzeDocumentRequest.newBuilder()
+ .setDocument(OpenNlpDocument.newBuilder().setRawText("Hello
world.").build())
+ .setProfile(AnalysisProfile.newBuilder()
+ .setProfileId("with-embed")
+ .addSteps(PipelineStep.PIPELINE_STEP_SENTENCE_DETECT)
+ .addSteps(PipelineStep.PIPELINE_STEP_TOKENIZE)
+ .addSteps(PipelineStep.PIPELINE_STEP_EMBED)
+ .build())
+ .build()));
+
+ assertEquals(AnalysisException.FailureType.NOT_FOUND,
error.getFailureType());
}
@Test
diff --git
a/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerSemanticChunkTest.java
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerSemanticChunkTest.java
new file mode 100644
index 00000000..adb8be38
--- /dev/null
+++
b/opennlp-grpc/opennlp-grpc-service/src/test/java/org/apache/opennlp/grpc/processor/BasicDocumentAnalyzerSemanticChunkTest.java
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the specific
+ * language governing permissions and limitations under the License.
+ */
+package org.apache.opennlp.grpc.processor;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.opennlp.grpc.embedding.StubEmbeddingProvider;
+import org.apache.opennlp.grpc.model.ModelBundleCache;
+import org.apache.opennlp.grpc.profile.ProfileRegistry;
+import org.apache.opennlp.grpc.v1.AnalyzeDocumentRequest;
+import org.apache.opennlp.grpc.v1.ChunkEmbedConfigEntry;
+import org.apache.opennlp.grpc.v1.ChunkEmbeddingGroup;
+import org.apache.opennlp.grpc.v1.ChunkingSpec;
+import org.apache.opennlp.grpc.v1.OpenNlpDocument;
+import org.apache.opennlp.grpc.v1.SemanticChunkingConfig;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+class BasicDocumentAnalyzerSemanticChunkTest {
+
+ private static final List<Float> TOPIC_BUSINESS = List.of(1f, 0f, 0f);
+ private static final List<Float> TOPIC_WEATHER = List.of(0f, 1f, 0f);
+
+ /** Embeds any text mentioning rain as the weather topic and everything else
as business. */
+ private final StubEmbeddingProvider embeddingProvider = new
StubEmbeddingProvider(
+ Map.of("minilm", 3),
+ (modelId, text) -> text.contains("rain")
+ ? new float[] {0f, 1f, 0f} : new float[] {1f, 0f, 0f},
+ Set.of());
+
+ private final BasicDocumentAnalyzer analyzer = new BasicDocumentAnalyzer(
+ ProfileRegistry.createDefault(),
+ new ModelBundleCache(Map.of()),
+ embeddingProvider);
+
+ @Test
+ void semanticChunkEmbedConfigSplitsAtTopicBoundary() {
+ final var response = analyzer.analyze(AnalyzeDocumentRequest.newBuilder()
+ .setDocument(OpenNlpDocument.newBuilder()
+ .setRawText("The merger closed on Monday. The shareholders
approved the deal. "
+ + "Heavy rain flooded the valley.")
+ .build())
+ .addChunkEmbedConfigs(ChunkEmbedConfigEntry.newBuilder()
+ .setConfigId("semantic-topics")
+ .setChunking(ChunkingSpec.newBuilder()
+ .setAlgorithm("semantic")
+ .setSemanticConfig(SemanticChunkingConfig.newBuilder()
+ .setSimilarityThreshold(0.5f)
+ .setSemanticEmbeddingModelId("minilm")
+ .build())
+ .build())
+ .addEmbeddingModelIds("minilm")
+ .build())
+ .build());
+
+ assertEquals(3, response.getDocument().getSentencesCount());
+ assertEquals(1, response.getDocument().getChunkEmbeddingGroupsCount());
+
+ final ChunkEmbeddingGroup group =
response.getDocument().getChunkEmbeddingGroups(0);
+ assertEquals(2, group.getChunksCount());
+ assertEquals(List.of(0, 1),
group.getChunks(0).getContainedSentenceIndicesList());
+ assertEquals(List.of(2),
group.getChunks(1).getContainedSentenceIndicesList());
+
+ assertEquals(1, group.getChunks(0).getEmbeddingsCount());
+ assertEquals(1, group.getChunks(1).getEmbeddingsCount());
+ assertEquals("minilm", group.getChunks(0).getEmbeddings(0).getModelId());
+ assertEquals(TOPIC_BUSINESS,
group.getChunks(0).getEmbeddings(0).getVectorList());
+ assertEquals(TOPIC_WEATHER,
group.getChunks(1).getEmbeddings(0).getVectorList());
+ }
+}