This is an automated email from the ASF dual-hosted git repository.

tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika-docker.git


The following commit(s) were added to refs/heads/main by this push:
     new d5e9c4e  update configs for 4.x
d5e9c4e is described below

commit d5e9c4ec3a6c4c6ee839770bf5b36f83a4ffa574
Author: tallison <[email protected]>
AuthorDate: Mon May 11 16:37:44 2026 -0400

    update configs for 4.x
---
 README.md                                          | 30 +++++-----
 docker-compose-tika-customocr.yml                  | 15 +++--
 docker-compose-tika-grobid.yml                     | 11 +++-
 docker-compose-tika-ner.yml                        | 30 ----------
 docker-compose-tika-vision.yml                     | 64 ++++++++++++----------
 sample-configs/customocr/tika-config-inline.json   | 11 ++++
 sample-configs/customocr/tika-config-inline.xml    | 31 -----------
 sample-configs/customocr/tika-config-rendered.json | 16 ++++++
 sample-configs/customocr/tika-config-rendered.xml  | 38 -------------
 sample-configs/grobid/tika-config.json             | 10 ++++
 sample-configs/grobid/tika-config.xml              | 24 --------
 sample-configs/ner/run_tika_server.sh              | 62 ---------------------
 sample-configs/ner/tika-config.xml                 | 28 ----------
 sample-configs/vision/inception-rest-caption.xml   | 32 -----------
 sample-configs/vision/inception-rest-video.xml     | 32 -----------
 sample-configs/vision/inception-rest.xml           | 32 -----------
 sample-configs/vision/vlm-claude.json              | 18 ++++++
 sample-configs/vision/vlm-gemini.json              | 17 ++++++
 sample-configs/vision/vlm-openai.json              | 19 +++++++
 19 files changed, 162 insertions(+), 358 deletions(-)

diff --git a/README.md b/README.md
index 59b6c6f..05b874a 100644
--- a/README.md
+++ b/README.md
@@ -152,22 +152,27 @@ From version 1.25 and 1.25-full of the image it is now 
easier to override the de
 So for example if you wish to disable the OCR parser in the full image you 
could write a custom configuration:
 
 ```
-cat <<EOT >> tika-config.xml
-<?xml version="1.0" encoding="UTF-8"?>
-<properties>
-  <parsers>
-      <parser class="org.apache.tika.parser.DefaultParser">
-          <parser-exclude 
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
-      </parser>
-  </parsers>
-</properties>
+cat <<EOT >> tika-config.json
+{
+  "parsers": [
+    { "default-parser": {} },
+    { "tesseract-ocr-parser": { "skipOcr": true } }
+  ]
+}
 EOT
 ```
 Then by mounting this custom configuration as a volume, you could pass the 
command line parameter to load it
 
-    docker run -d -p 127.0.0.1:9998:9998 -v 
`pwd`/tika-config.xml:/tika-config.xml apache/tika:2.5.0-full --config 
/tika-config.xml
+    docker run -d -p 127.0.0.1:9998:9998 -v 
`pwd`/tika-config.json:/tika-config.json apache/tika:<tag>-full -c 
/tika-config.json
 
-You can see more configuration examples 
[here](https://tika.apache.org/2.5.0/configuring.html).
+NOTE: Tika 4.x replaced the XML `tika-config.xml` format with JSON
+`tika-config.json` (see TIKA-4544). The XML form above is what 2.x / 3.x
+images expect; if you're pinned to those tags, keep using the XML.
+
+You can see more configuration examples on the
+[Tika website](https://tika.apache.org/) and in the canonical samples under
+`tika-server/tika-server-core/src/test/resources/config-examples/` in the
+source tree.
 
 As of 2.5.0.2, if you'd like to add extra jars from your local `my-jars` 
directory to Tika's classpath, mount to `/tika-extras` like so:
 
@@ -182,10 +187,9 @@ There are a number of sample Docker Compose files included 
in the repos to allow
 
 These files use docker-compose 3.x series and include:
 
-* docker-compose-tika-vision.yml - TensorFlow Inception REST API Vision 
examples
+* docker-compose-tika-vision.yml - Vision-Language Model parsing example 
(OpenAI-compatible / Claude / Gemini)
 * docker-compose-tika-grobid.yml - Grobid REST parsing example
 * docker-compose-tika-customocr.yml - Tesseract OCR example with custom 
configuration
-* docker-compose-tika-ner.yml - Named Entity Recognition example
 
 The Docker Compose files and configurations (sourced from _sample-configs_ 
directory) all have comments in them so you can try different options, or use 
them as a base to create your own custom configuration.
 
diff --git a/docker-compose-tika-customocr.yml 
b/docker-compose-tika-customocr.yml
index 7428c2d..29cf667 100644
--- a/docker-compose-tika-customocr.yml
+++ b/docker-compose-tika-customocr.yml
@@ -19,16 +19,21 @@ services:
   ## Apache Tika Server 
   tika:
     image: apache/tika:${TAG}-full
-    # Override default so we can add configuration on classpath
-    entrypoint: [ "/bin/sh", "-c", "exec java -cp 
\"/customocr:/tika-server-standard-$${TIKA_VERSION}.jar:/tika-extras/*\" 
org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $$0 $$@"]
+    # Override default so we can add the /customocr dir on the classpath
+    # (for the bundled TesseractOCRConfig.properties). The 4.x image layout
+    # places the thin server jar at /opt/tika-server/tika-server.jar and its
+    # deps at /opt/tika-server/lib/*. working_dir=/opt/tika-server matters for
+    # tika-server's plugin-roots fallback (see 
TikaServerProcess#resolveDefaultPluginsDir).
+    entrypoint: [ "/bin/sh", "-c", "exec java -cp 
\"/customocr:/opt/tika-server/tika-server.jar:/opt/tika-server/lib/*:/tika-extras/*\"
 org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $$0 $$@"]
+    working_dir: /opt/tika-server
     # Kept command as example but could be added to entrypoint too
-    command: -c /tika-config.xml
+    command: -c /tika-config.json
     restart: on-failure
     ports:
       - "9998:9998"
     volumes:
       # Choose the configuration you want, or add your own custom one
-      # -  ./sample-configs/customocr/tika-config-inline.xml:/tika-config.xml
-      -  ./sample-configs/customocr/tika-config-rendered.xml:/tika-config.xml
+      # -  ./sample-configs/customocr/tika-config-inline.json:/tika-config.json
+      -  ./sample-configs/customocr/tika-config-rendered.json:/tika-config.json
 
    
diff --git a/docker-compose-tika-grobid.yml b/docker-compose-tika-grobid.yml
index 4c056ae..add5d27 100644
--- a/docker-compose-tika-grobid.yml
+++ b/docker-compose-tika-grobid.yml
@@ -19,10 +19,15 @@ services:
   ## Apache Tika Server 
   tika:
     image: apache/tika:${TAG}-full
-    # Override default so we can add configuration on classpath
-    entrypoint: [ "/bin/sh", "-c", "exec java -cp 
\"/grobid:/tika-server-standard-$${TIKA_VERSION}.jar:/tika-extras/*\" 
org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $$0 $$@"]
+    # Override default so we can add the /grobid dir on the classpath
+    # (for the bundled GrobidExtractor.properties). The 4.x image layout
+    # places the thin server jar at /opt/tika-server/tika-server.jar and its
+    # deps at /opt/tika-server/lib/*. working_dir=/opt/tika-server matters for
+    # tika-server's plugin-roots fallback.
+    entrypoint: [ "/bin/sh", "-c", "exec java -cp 
\"/grobid:/opt/tika-server/tika-server.jar:/opt/tika-server/lib/*:/tika-extras/*\"
 org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $$0 $$@"]
+    working_dir: /opt/tika-server
     # Kept command as example but could be added to entrypoint too
-    command: -c /grobid/tika-config.xml
+    command: -c /grobid/tika-config.json
     restart: on-failure
     ports:
       - "9998:9998"
diff --git a/docker-compose-tika-ner.yml b/docker-compose-tika-ner.yml
deleted file mode 100644
index 50e896a..0000000
--- a/docker-compose-tika-ner.yml
+++ /dev/null
@@ -1,30 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#    http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-version: "3.8"
-services:
- 
-  ## Apache Tika Server 
-  tika:
-    image: apache/tika:${TAG}-full
-    # Use custom script as entrypoint to go fetch models and setup recognisers
-    entrypoint: [ "/ner/run_tika_server.sh"]
-    restart: on-failure
-    ports:
-      - "9998:9998"
-    volumes:
-      -  ./sample-configs/ner/:/ner/
-    environment:
-      - TAG
\ No newline at end of file
diff --git a/docker-compose-tika-vision.yml b/docker-compose-tika-vision.yml
index 9e054ec..da01d03 100644
--- a/docker-compose-tika-vision.yml
+++ b/docker-compose-tika-vision.yml
@@ -13,42 +13,50 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-version: "3.8"
+# Vision-Language Model parsing for tika-server (Tika 4.x).
+#
+# The pre-4.x inception-rest / Im2txt / inception-video services and the
+# org.apache.tika.parser.recognition.ObjectRecognitionParser they served
+# have been removed (TIKA-4499 / TIKA-4500). The 4.x replacement is a
+# family of VLM parsers (OpenAI-compatible, Anthropic Claude, Google
+# Gemini). See:
+#
+#   docs/modules/ROOT/pages/configuration/parsers/vlm-parsers.adoc
+#
+# This compose demonstrates the OpenAI-compatible variant pointing at a
+# locally-hosted Ollama instance. To use a different VLM:
+#   - Swap the mounted tika-config.* for vlm-claude.json or vlm-gemini.json
+#     and pass the relevant API key via env (ANTHROPIC_API_KEY /
+#     GEMINI_API_KEY).
+#   - Drop the vlm-server service block below.
+
 services:
- 
-  ## Apache Tika Server 
+
+  ## Apache Tika Server
   tika:
     image: apache/tika:latest-full
-    command: -c /tika-config.xml
+    command: -c /tika-config.json
     restart: on-failure
     ports:
       - "9998:9998"
-
     volumes:
-      # Replace the below with the configuration you want to use, or with your 
own custom one 
-      # -  ./sample-configs/vision/inception-rest.xml:/tika-config.xml
-      # -  ./sample-configs/vision/inception-rest-video.xml:/tika-config.xml
-      -  ./sample-configs/vision/inception-rest-caption.xml:/tika-config.xml
-   
+      - ./sample-configs/vision/vlm-openai.json:/tika-config.json
+      # - ./sample-configs/vision/vlm-claude.json:/tika-config.json
+      # - ./sample-configs/vision/vlm-gemini.json:/tika-config.json
     depends_on:
-      # You can comment out any you don't need here and in the Vision Service 
section below
-      - inception-rest
-      - inception-caption
-      - inception-video
-
-  ## Vision Services 
-  inception-rest:
-    build: 
https://raw.githubusercontent.com/dameikle/tika-dockers/patch-1/InceptionRestDockerfile
-    ports:
-      - "8764:8764"
-
-  inception-caption:
-    build: 
https://raw.githubusercontent.com/dameikle/tika-dockers/patch-1/Im2txtRestDockerfile
-    ports:
-      - "8765:8764"
+      - vlm-server
 
-  inception-video:
-    build: 
https://raw.githubusercontent.com/dameikle/tika-dockers/patch-1/InceptionVideoRestDockerfile
+  ## Local OpenAI-compatible VLM endpoint.
+  ## Replace with vLLM, your own FastAPI wrapper, or remove and point
+  ## baseUrl in vlm-openai.json at OpenAI's real API.
+  vlm-server:
+    image: ollama/ollama:latest
     ports:
-      - "8766:8764"
+      - "8000:11434"
+    # Volumes for pulled models. Uncomment and pull a vision-capable model
+    # (e.g. `docker exec <container> ollama pull llava`) before first use.
+    # volumes:
+    #   - ollama-models:/root/.ollama
 
+# volumes:
+#   ollama-models:
diff --git a/sample-configs/customocr/tika-config-inline.json 
b/sample-configs/customocr/tika-config-inline.json
new file mode 100644
index 0000000..055e72c
--- /dev/null
+++ b/sample-configs/customocr/tika-config-inline.json
@@ -0,0 +1,11 @@
+{
+  "_comment": "Extract inline images from PDF and OCR them with Tesseract.",
+  "parsers": [
+    { "tesseract-ocr-parser": {} },
+    {
+      "pdf-parser": {
+        "extractInlineImages": true
+      }
+    }
+  ]
+}
diff --git a/sample-configs/customocr/tika-config-inline.xml 
b/sample-configs/customocr/tika-config-inline.xml
deleted file mode 100644
index 1c9b613..0000000
--- a/sample-configs/customocr/tika-config-inline.xml
+++ /dev/null
@@ -1,31 +0,0 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?>
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one or more
-  ~ contributor license agreements.  See the NOTICE file distributed with
-  ~ this work for additional information regarding copyright ownership.
-  ~ The ASF licenses this file to You under the Apache License, Version 2.0
-  ~ (the "License"); you may not use this file except in compliance with
-  ~ the License.  You may obtain a copy of the License at
-  ~
-  ~    http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing, software
-  ~ distributed under the License is distributed on an "AS IS" BASIS,
-  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  ~ See the License for the specific language governing permissions and
-  ~ limitations under the License.
-  -->
-<properties>
-  <parsers>     
-        <!-- Load TesseractOCRParser (could use DefaultParser if you want 
others too) -->
-        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"/>   
-
-        <!-- Extract and OCR Inline Images in PDF -->
-        <parser class="org.apache.tika.parser.pdf.PDFParser">
-            <params>
-                <param name="extractInlineImages" type="bool">true</param>
-            </params>
-        </parser>
-        
-  </parsers>
-</properties>
diff --git a/sample-configs/customocr/tika-config-rendered.json 
b/sample-configs/customocr/tika-config-rendered.json
new file mode 100644
index 0000000..45f3d3b
--- /dev/null
+++ b/sample-configs/customocr/tika-config-rendered.json
@@ -0,0 +1,16 @@
+{
+  "_comment": [
+    "Render each PDF page as an image and run Tesseract on it.",
+    "ocrStrategy options: no_ocr, ocr_only, ocr_and_text, auto."
+  ],
+  "parsers": [
+    { "tesseract-ocr-parser": {} },
+    {
+      "pdf-parser": {
+        "ocrStrategy": "ocr_only",
+        "ocrImageType": "rgb",
+        "ocrDPI": 100
+      }
+    }
+  ]
+}
diff --git a/sample-configs/customocr/tika-config-rendered.xml 
b/sample-configs/customocr/tika-config-rendered.xml
deleted file mode 100644
index bcd8666..0000000
--- a/sample-configs/customocr/tika-config-rendered.xml
+++ /dev/null
@@ -1,38 +0,0 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?>
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one or more
-  ~ contributor license agreements.  See the NOTICE file distributed with
-  ~ this work for additional information regarding copyright ownership.
-  ~ The ASF licenses this file to You under the Apache License, Version 2.0
-  ~ (the "License"); you may not use this file except in compliance with
-  ~ the License.  You may obtain a copy of the License at
-  ~
-  ~    http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing, software
-  ~ distributed under the License is distributed on an "AS IS" BASIS,
-  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  ~ See the License for the specific language governing permissions and
-  ~ limitations under the License.
-  -->
-<properties>
-  <parsers>     
-        <!-- Load TesseractOCRParser (could use DefaultParser if you want 
others too) -->
-        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"/>   
-
-        <!-- OCR on Rendered Pages -->
-        <parser class="org.apache.tika.parser.pdf.PDFParser">
-            <params>
-                <!-- no_ocr - extract text only
-                     ocr_only - don't extract text and just attempt OCR
-                     ocr_and_text - extract text and attempt OCR (from Tika 
1.24)
-                     auto - extract text but if < 10 characters try OCR
-                -->
-                <param name="ocrStrategy" type="string">ocr_only</param>
-                <param name="ocrImageType" type="string">rgb</param>
-                <param name="ocrDPI" type="int">100</param>
-            </params>
-        </parser>
-
-  </parsers>
-</properties>
diff --git a/sample-configs/grobid/tika-config.json 
b/sample-configs/grobid/tika-config.json
new file mode 100644
index 0000000..943ec19
--- /dev/null
+++ b/sample-configs/grobid/tika-config.json
@@ -0,0 +1,10 @@
+{
+  "_comment": "Route PDFs through GROBID (via JournalParser) for 
journal-article extraction.",
+  "parsers": [
+    {
+      "journal-parser": {
+        "_mime-include": ["application/pdf"]
+      }
+    }
+  ]
+}
diff --git a/sample-configs/grobid/tika-config.xml 
b/sample-configs/grobid/tika-config.xml
deleted file mode 100644
index 5b4aad9..0000000
--- a/sample-configs/grobid/tika-config.xml
+++ /dev/null
@@ -1,24 +0,0 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?>
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one or more
-  ~ contributor license agreements.  See the NOTICE file distributed with
-  ~ this work for additional information regarding copyright ownership.
-  ~ The ASF licenses this file to You under the Apache License, Version 2.0
-  ~ (the "License"); you may not use this file except in compliance with
-  ~ the License.  You may obtain a copy of the License at
-  ~
-  ~    http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing, software
-  ~ distributed under the License is distributed on an "AS IS" BASIS,
-  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  ~ See the License for the specific language governing permissions and
-  ~ limitations under the License.
-  -->
-<properties>
-  <parsers>
-    <parser class="org.apache.tika.parser.journal.JournalParser">
-      <mime>application/pdf</mime>
-    </parser>
-  </parsers>
-</properties>
diff --git a/sample-configs/ner/run_tika_server.sh 
b/sample-configs/ner/run_tika_server.sh
deleted file mode 100755
index fb447be..0000000
--- a/sample-configs/ner/run_tika_server.sh
+++ /dev/null
@@ -1,62 +0,0 @@
-#!/bin/bash
-
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-#############################################################################
-# See https://cwiki.apache.org/confluence/display/TIKA/TikaAndNER for details
-# on how to configure additional NER libraries
-#############################################################################
-
-# ------------------------------------
-# Download OpenNLP Models to classpath
-# ------------------------------------
-
-OPENNLP_LOCATION="/ner/org/apache/tika/parser/ner/opennlp"
-URL="http://opennlp.sourceforge.net/models-1.5";
-
-mkdir -p $OPENNLP_LOCATION
-if [ "$(ls -A $OPENNLP_LOCATION/*.bin)" ]; then
-    echo "OpenNLP models directory has files, so skipping fetch";
-else
-       echo "No OpenNLP models found, so fetching them"
-       wget "$URL/en-ner-person.bin" -O $OPENNLP_LOCATION/ner-person.bin
-       wget "$URL/en-ner-location.bin" -O $OPENNLP_LOCATION/ner-location.bin
-       wget "$URL/en-ner-organization.bin" -O 
$OPENNLP_LOCATION/ner-organization.bin;
-       wget "$URL/en-ner-date.bin" -O $OPENNLP_LOCATION/ner-date.bin
-       wget "$URL/en-ner-time.bin" -O $OPENNLP_LOCATION/ner-time.bin
-       wget "$URL/en-ner-percentage.bin" -O 
$OPENNLP_LOCATION/ner-percentage.bin
-       wget "$URL/en-ner-money.bin" -O $OPENNLP_LOCATION/ner-money.bin
-fi
-
-# --------------------------------------------
-# Create RexExp Example for Email on classpath
-# --------------------------------------------
-REGEXP_LOCATION="/ner/org/apache/tika/parser/ner/regex"
-mkdir -p $REGEXP_LOCATION
-echo 
"EMAIL=(?:[a-z0-9!#$%&'*+/=?^_\`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"
 > $REGEXP_LOCATION/ner-regex.txt
-
-
-# -------------------
-# Now run Tika Server
-# -------------------
-
-# Can be a single implementation or comma seperated list for multiple for 
"ner.impl.class" property
-RECOGNISERS=org.apache.tika.parser.ner.opennlp.OpenNLPNERecogniser,org.apache.tika.parser.ner.regex.RegexNERecogniser
-# Set classpath to the Tika Server JAR and the /ner folder so it has the 
configuration and models from above
-CLASSPATH="/ner:/tika-server-standard-${TIKA_VERSION}.jar:/tika-extras/*"
-# Run the server with the custom configuration ner.impl.class property and 
custom /ner/tika-config.xml
-exec java -Dner.impl.class=$RECOGNISERS -cp $CLASSPATH 
org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 -c /ner/tika-config.xml
\ No newline at end of file
diff --git a/sample-configs/ner/tika-config.xml 
b/sample-configs/ner/tika-config.xml
deleted file mode 100644
index 65d5774..0000000
--- a/sample-configs/ner/tika-config.xml
+++ /dev/null
@@ -1,28 +0,0 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?>
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one or more
-  ~ contributor license agreements.  See the NOTICE file distributed with
-  ~ this work for additional information regarding copyright ownership.
-  ~ The ASF licenses this file to You under the Apache License, Version 2.0
-  ~ (the "License"); you may not use this file except in compliance with
-  ~ the License.  You may obtain a copy of the License at
-  ~
-  ~    http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing, software
-  ~ distributed under the License is distributed on an "AS IS" BASIS,
-  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  ~ See the License for the specific language governing permissions and
-  ~ limitations under the License.
-  -->
-<properties>
-    <parsers>
-        <parser class="org.apache.tika.parser.ner.NamedEntityParser">
-            <mime>application/pdf</mime>
-            <mime>text/plain</mime>
-            <mime>text/html</mime>
-            <mime>application/xhtml+xml</mime>
-        </parser>
-    </parsers>
-</properties>
-
diff --git a/sample-configs/vision/inception-rest-caption.xml 
b/sample-configs/vision/inception-rest-caption.xml
deleted file mode 100644
index c70c207..0000000
--- a/sample-configs/vision/inception-rest-caption.xml
+++ /dev/null
@@ -1,32 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one or more
-  ~ contributor license agreements.  See the NOTICE file distributed with
-  ~ this work for additional information regarding copyright ownership.
-  ~ The ASF licenses this file to You under the Apache License, Version 2.0
-  ~ (the "License"); you may not use this file except in compliance with
-  ~ the License.  You may obtain a copy of the License at
-  ~
-  ~    http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing, software
-  ~ distributed under the License is distributed on an "AS IS" BASIS,
-  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  ~ See the License for the specific language governing permissions and
-  ~ limitations under the License.
-  -->
-<properties>
-    <parsers>
-        <parser 
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
-            <mime>image/jpeg</mime>
-            <mime>image/png</mime>
-            <mime>image/gif</mime>
-            <params>
-                <param name="apiBaseUri" 
type="uri">http://inception-caption:8764/inception/v3</param>
-                <param name="captions" type="int">5</param>
-                <param name="maxCaptionLength" type="int">15</param>
-                <param name="class" 
type="string">org.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner</param>
-            </params>
-        </parser>
-    </parsers>
-</properties>
\ No newline at end of file
diff --git a/sample-configs/vision/inception-rest-video.xml 
b/sample-configs/vision/inception-rest-video.xml
deleted file mode 100644
index f6a4e6a..0000000
--- a/sample-configs/vision/inception-rest-video.xml
+++ /dev/null
@@ -1,32 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one or more
-  ~ contributor license agreements.  See the NOTICE file distributed with
-  ~ this work for additional information regarding copyright ownership.
-  ~ The ASF licenses this file to You under the Apache License, Version 2.0
-  ~ (the "License"); you may not use this file except in compliance with
-  ~ the License.  You may obtain a copy of the License at
-  ~
-  ~    http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing, software
-  ~ distributed under the License is distributed on an "AS IS" BASIS,
-  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  ~ See the License for the specific language governing permissions and
-  ~ limitations under the License.
-  -->
-<properties>
-    <parsers>
-        <parser 
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
-            <mime>video/mp4</mime>
-            <mime>video/quicktime</mime>
-            <params>
-                <param name="apiBaseUri" 
type="uri">http://inception-video:8764/inception/v4</param>
-                <param name="topN" type="int">4</param>
-                <param name="minConfidence" type="double">0.015</param>
-                <param name="mode" type="string">fixed</param>
-                <param name="class" 
type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTVideoRecogniser</param>
-            </params>
-        </parser>
-    </parsers>
-</properties>
\ No newline at end of file
diff --git a/sample-configs/vision/inception-rest.xml 
b/sample-configs/vision/inception-rest.xml
deleted file mode 100644
index caa6468..0000000
--- a/sample-configs/vision/inception-rest.xml
+++ /dev/null
@@ -1,32 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-  ~ Licensed to the Apache Software Foundation (ASF) under one or more
-  ~ contributor license agreements.  See the NOTICE file distributed with
-  ~ this work for additional information regarding copyright ownership.
-  ~ The ASF licenses this file to You under the Apache License, Version 2.0
-  ~ (the "License"); you may not use this file except in compliance with
-  ~ the License.  You may obtain a copy of the License at
-  ~
-  ~    http://www.apache.org/licenses/LICENSE-2.0
-  ~
-  ~ Unless required by applicable law or agreed to in writing, software
-  ~ distributed under the License is distributed on an "AS IS" BASIS,
-  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  ~ See the License for the specific language governing permissions and
-  ~ limitations under the License.
-  -->
-<properties>
-    <parsers>
-        <parser 
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
-            <mime>image/jpeg</mime>
-            <mime>image/png</mime>
-            <mime>image/gif</mime>
-            <params>
-                <param name="apiBaseUri" 
type="uri">http://inception-rest:8764/inception/v4</param>
-                <param name="topN" type="int">2</param>
-                <param name="minConfidence" type="double">0.015</param>
-                <param name="class" 
type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
-            </params>
-        </parser>
-    </parsers>
-</properties>
diff --git a/sample-configs/vision/vlm-claude.json 
b/sample-configs/vision/vlm-claude.json
new file mode 100644
index 0000000..e233516
--- /dev/null
+++ b/sample-configs/vision/vlm-claude.json
@@ -0,0 +1,18 @@
+{
+  "_comment": [
+    "Vision-Language Model parsing via Anthropic's Claude API.",
+    "Claude can handle OCR images and PDFs natively (no rasterization 
needed).",
+    "Set apiKey to your Anthropic API key — DO NOT commit a real key.",
+    "Prefer passing it via the ANTHROPIC_API_KEY env var and substituting it",
+    "at container start, e.g. via an entrypoint shim or sidecar that 
templates",
+    "this file. See docs: configuration/parsers/vlm-parsers."
+  ],
+  "parsers": [
+    {
+      "claude-vlm-parser": {
+        "apiKey": "${ANTHROPIC_API_KEY}",
+        "model": "claude-sonnet-4-20250514"
+      }
+    }
+  ]
+}
diff --git a/sample-configs/vision/vlm-gemini.json 
b/sample-configs/vision/vlm-gemini.json
new file mode 100644
index 0000000..4c33e69
--- /dev/null
+++ b/sample-configs/vision/vlm-gemini.json
@@ -0,0 +1,17 @@
+{
+  "_comment": [
+    "Vision-Language Model parsing via Google's Gemini generateContent API.",
+    "Gemini can handle OCR images and PDFs natively (no rasterization 
needed).",
+    "Set apiKey to your Google AI Studio API key — DO NOT commit a real key.",
+    "Prefer GEMINI_API_KEY env var + a templating entrypoint, similar to the",
+    "Claude config. See docs: configuration/parsers/vlm-parsers."
+  ],
+  "parsers": [
+    {
+      "gemini-vlm-parser": {
+        "apiKey": "${GEMINI_API_KEY}",
+        "model": "gemini-2.5-flash"
+      }
+    }
+  ]
+}
diff --git a/sample-configs/vision/vlm-openai.json 
b/sample-configs/vision/vlm-openai.json
new file mode 100644
index 0000000..2a4b675
--- /dev/null
+++ b/sample-configs/vision/vlm-openai.json
@@ -0,0 +1,19 @@
+{
+  "_comment": [
+    "Vision-Language Model parsing via an OpenAI-compatible endpoint.",
+    "Works with self-hosted backends (vLLM, Ollama, a local FastAPI wrapper)",
+    "or against OpenAI's own chat-completions API. Set baseUrl to wherever",
+    "the OpenAI-compatible endpoint is reachable from the tika container.",
+    "If the endpoint requires authentication, also set apiKey.",
+    "See docs: configuration/parsers/vlm-parsers."
+  ],
+  "parsers": [
+    {
+      "openai-vlm-parser": {
+        "baseUrl": "http://vlm-server:8000";,
+        "model": "jinaai/jina-vlm",
+        "timeoutSeconds": 300
+      }
+    }
+  ]
+}

Reply via email to