This is an automated email from the ASF dual-hosted git repository. rzo1 pushed a commit to branch docs in repository https://gitbox.apache.org/repos/asf/opennlp.git
commit afdbcc39d6639c8d0a6e26a38617ccb2c51f5f8a Author: Richard Zowalla <[email protected]> AuthorDate: Tue Mar 17 08:48:54 2026 +0100 OPENNLP-1714 - Adjust Dev Manual to modularized structure --- opennlp-docs/src/docbkx/opennlp.xml | 1 + opennlp-docs/src/docbkx/project-structure.xml | 313 ++++++++++++++++++++++++++ 2 files changed, 314 insertions(+) diff --git a/opennlp-docs/src/docbkx/opennlp.xml b/opennlp-docs/src/docbkx/opennlp.xml index badff447..fea7437d 100644 --- a/opennlp-docs/src/docbkx/opennlp.xml +++ b/opennlp-docs/src/docbkx/opennlp.xml @@ -97,6 +97,7 @@ under the License. <title>Apache OpenNLP Developer Documentation</title> <toc/> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./introduction.xml"/> + <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./project-structure.xml"/> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./langdetect.xml" /> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./sentdetect.xml"/> <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./tokenizer.xml" /> diff --git a/opennlp-docs/src/docbkx/project-structure.xml b/opennlp-docs/src/docbkx/project-structure.xml new file mode 100644 index 00000000..40394382 --- /dev/null +++ b/opennlp-docs/src/docbkx/project-structure.xml @@ -0,0 +1,313 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN" +"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd"[ +]> +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +<chapter xml:id="tools.project.structure" xmlns:xlink="http://www.w3.org/1999/xlink"> +<title>Project Structure</title> + + <section xml:id="tools.project.structure.overview"> + <title>Overview</title> + <para> + Starting with version 3.0, Apache OpenNLP has been reorganized from a single monolithic + <code>opennlp-tools</code> artifact into a set of fine-grained modules. This modularization + allows users to depend only on the components they actually need, resulting in a smaller + dependency footprint. At the same time, the public API remains stable and fully compatible + with previous 2.x releases. + </para> + <para> + The following sections describe each module, its purpose, and when to include it as a dependency. + </para> + </section> + + <section xml:id="tools.project.structure.api"> + <title>API Module</title> + <para> + The <code>opennlp-api</code> module defines the public interfaces and abstractions + that form the contract between OpenNLP and its users. It contains the core interfaces + such as <code>Tokenizer</code>, <code>SentenceDetector</code>, <code>POSTagger</code>, + <code>TokenNameFinder</code>, <code>Chunker</code>, <code>Parser</code>, + <code>LanguageDetector</code>, <code>Lemmatizer</code>, and <code>DocumentCategorizer</code>. + </para> + <para> + This module also provides shared base classes such as <code>BaseModel</code>, + the <code>ObjectStream</code> abstraction for data processing, the command-line + argument parsing framework, and common utility types. It is a transitive dependency + of <code>opennlp-runtime</code> and typically does not need to be declared explicitly. + </para> + + <programlisting language="xml"> +<![CDATA[<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-api</artifactId> + <version>CURRENT_OPENNLP_VERSION</version> +</dependency>]]> + </programlisting> + </section> + + <section xml:id="tools.project.structure.runtime"> + <title>Runtime Module</title> + <para> + The <code>opennlp-runtime</code> module is the primary dependency for most users. It + contains the core NLP tool implementations including sentence detection, tokenization, + part-of-speech tagging, named entity recognition, chunking, parsing, language detection, + lemmatization, and document categorization. + </para> + <para> + By default, <code>opennlp-runtime</code> ships with the Maximum Entropy machine + learning implementation. If you need other ML algorithms, add the corresponding + ML module as described below. + </para> + + <programlisting language="xml"> +<![CDATA[<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-runtime</artifactId> + <version>CURRENT_OPENNLP_VERSION</version> +</dependency>]]> + </programlisting> + </section> + + <section xml:id="tools.project.structure.ml"> + <title>Machine Learning Modules</title> + <para> + The machine learning implementations have been separated into individual modules so that + applications can include only the algorithms they use. Each module provides a specific + ML algorithm and is loaded at runtime via the <code>ExtensionLoader</code> service + discovery mechanism. + </para> + + <itemizedlist> + <listitem> + <para> + <code>opennlp-ml-commons</code> — Shared ML utilities and base classes used + by all ML algorithm modules. This is a transitive dependency of each ML module + and does not need to be declared explicitly. + </para> + </listitem> + <listitem> + <para> + <code>opennlp-ml-maxent</code> — Maximum Entropy classifier. This is the default + algorithm and is included transitively via <code>opennlp-runtime</code>. + </para> + </listitem> + <listitem> + <para> + <code>opennlp-ml-perceptron</code> — Perceptron-based learning algorithm. + Add this dependency if your models use the Perceptron or Perceptron Sequence trainer. + </para> + </listitem> + <listitem> + <para> + <code>opennlp-ml-bayes</code> — Naive Bayes classifier. + Add this dependency if your models use the Naive Bayes trainer. + </para> + </listitem> + </itemizedlist> + + <para> + For example, to use the Perceptron trainer alongside the default Maximum Entropy, add: + </para> + + <programlisting language="xml"> +<![CDATA[<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-ml-perceptron</artifactId> + <version>CURRENT_OPENNLP_VERSION</version> +</dependency>]]> + </programlisting> + </section> + + <section xml:id="tools.project.structure.models"> + <title>Models Module</title> + <para> + The <code>opennlp-models</code> module provides classpath-based model discovery and + loading. It enables applications to bundle pre-trained OpenNLP models as JAR files and + load them at runtime without explicit file path references. + See <xref linkend="tools.model"/> for details on classpath model loading. + </para> + + <programlisting language="xml"> +<![CDATA[<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-models</artifactId> + <version>CURRENT_OPENNLP_VERSION</version> +</dependency>]]> + </programlisting> + </section> + + <section xml:id="tools.project.structure.formats"> + <title>Formats Module</title> + <para> + The <code>opennlp-formats</code> module supports reading and writing various NLP + training and evaluation data formats, including CoNLL, BioNLP, BRAT, AD (Floresta), + Leipzig, and others. Include this module if you need to train models from data in + non-native OpenNLP formats. + </para> + + <programlisting language="xml"> +<![CDATA[<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-formats</artifactId> + <version>CURRENT_OPENNLP_VERSION</version> +</dependency>]]> + </programlisting> + </section> + + <section xml:id="tools.project.structure.dl"> + <title>Deep Learning Modules</title> + <para> + OpenNLP provides optional support for ONNX-based neural models via two modules: + </para> + + <itemizedlist> + <listitem> + <para> + <code>opennlp-dl</code> — Integrates the ONNX Runtime for CPU-based inference. + This module enables the use of models trained by external frameworks such as + PyTorch or TensorFlow, exported in the ONNX format. + </para> + </listitem> + <listitem> + <para> + <code>opennlp-dl-gpu</code> — Replaces the CPU ONNX Runtime with the + GPU-accelerated variant for systems with supported GPU hardware. + Use this module instead of <code>opennlp-dl</code> when GPU acceleration + is available and desired. + </para> + </listitem> + </itemizedlist> + + <programlisting language="xml"> +<![CDATA[<!-- CPU variant --> +<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-dl</artifactId> + <version>CURRENT_OPENNLP_VERSION</version> +</dependency> + +<!-- OR GPU variant (do not include both) --> +<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-dl-gpu</artifactId> + <version>CURRENT_OPENNLP_VERSION</version> +</dependency>]]> + </programlisting> + </section> + + <section xml:id="tools.project.structure.cli"> + <title>CLI Module</title> + <para> + The <code>opennlp-cli</code> module provides the command-line tools for training, + evaluating, and running OpenNLP models from a terminal. It is included in the binary + distribution and not typically needed as a library dependency. + See <xref linkend="tools.cli"/> for details on available CLI commands. + </para> + </section> + + <section xml:id="tools.project.structure.tools"> + <title>Tools Module (Aggregated Jar)</title> + <para> + The <code>opennlp-tools</code> module is an aggregated artifact that bundles + all core modules (<code>opennlp-api</code>, <code>opennlp-runtime</code>, + all ML modules, <code>opennlp-models</code>, <code>opennlp-formats</code>, + and <code>opennlp-cli</code>) into a single JAR. It is provided for backwards + compatibility with 2.x and for the binary distribution. + </para> + <para> + For new projects, we recommend depending on <code>opennlp-runtime</code> + plus only the specific additional modules you need, rather than pulling in + the full <code>opennlp-tools</code> artifact. + </para> + </section> + + <section xml:id="tools.project.structure.extensions"> + <title>Extension Modules</title> + <para> + OpenNLP provides optional extension modules for integration with external frameworks: + </para> + + <itemizedlist> + <listitem> + <para> + <code>opennlp-morfologik</code> — Integrates the + <link xlink:href="https://github.com/morfologik">Morfologik</link> + library for dictionary-based stemming and lemmatization. + See <xref linkend="tools.morfologik"/> for usage details. + </para> + </listitem> + <listitem> + <para> + <code>opennlp-uima</code> — Provides a set of + <link xlink:href="https://uima.apache.org">Apache UIMA</link> + annotators that wrap OpenNLP components for use in UIMA pipelines. + See <xref linkend="tools.uima"/> for integration details. + </para> + </listitem> + </itemizedlist> + </section> + + <section xml:id="tools.project.structure.migration"> + <title>Migrating from 2.x to 3.x</title> + <para> + The 3.x release introduces no known breaking API changes. Existing code using the + <code>opennlp-tools</code> artifact will continue to work without modification. + However, we strongly recommend migrating to the modular dependency structure for a + smaller footprint. + </para> + <para> + A minimal migration replaces: + </para> + + <programlisting language="xml"> +<![CDATA[<!-- 2.x: single monolithic dependency --> +<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-tools</artifactId> + <version>2.x.y</version> +</dependency>]]> + </programlisting> + + <para> + with: + </para> + + <programlisting language="xml"> +<![CDATA[<!-- 3.x: modular dependencies — add only what you need --> +<dependency> + <groupId>org.apache.opennlp</groupId> + <artifactId>opennlp-runtime</artifactId> + <version>CURRENT_OPENNLP_VERSION</version> +</dependency> +<!-- Add opennlp-models, opennlp-ml-perceptron, opennlp-dl, etc. as needed -->]]> + </programlisting> + + <note> + <para> + The <code>opennlp-runtime</code> module includes the Maximum Entropy ML + implementation by default. If your models were trained with the Perceptron + or Naive Bayes algorithm, add the corresponding <code>opennlp-ml-perceptron</code> + or <code>opennlp-ml-bayes</code> dependency. + </para> + </note> + </section> + +</chapter>
