This is an automated email from the ASF dual-hosted git repository.
tballison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika-docker.git
The following commit(s) were added to refs/heads/main by this push:
new d5e9c4e update configs for 4.x
d5e9c4e is described below
commit d5e9c4ec3a6c4c6ee839770bf5b36f83a4ffa574
Author: tallison <[email protected]>
AuthorDate: Mon May 11 16:37:44 2026 -0400
update configs for 4.x
---
README.md | 30 +++++-----
docker-compose-tika-customocr.yml | 15 +++--
docker-compose-tika-grobid.yml | 11 +++-
docker-compose-tika-ner.yml | 30 ----------
docker-compose-tika-vision.yml | 64 ++++++++++++----------
sample-configs/customocr/tika-config-inline.json | 11 ++++
sample-configs/customocr/tika-config-inline.xml | 31 -----------
sample-configs/customocr/tika-config-rendered.json | 16 ++++++
sample-configs/customocr/tika-config-rendered.xml | 38 -------------
sample-configs/grobid/tika-config.json | 10 ++++
sample-configs/grobid/tika-config.xml | 24 --------
sample-configs/ner/run_tika_server.sh | 62 ---------------------
sample-configs/ner/tika-config.xml | 28 ----------
sample-configs/vision/inception-rest-caption.xml | 32 -----------
sample-configs/vision/inception-rest-video.xml | 32 -----------
sample-configs/vision/inception-rest.xml | 32 -----------
sample-configs/vision/vlm-claude.json | 18 ++++++
sample-configs/vision/vlm-gemini.json | 17 ++++++
sample-configs/vision/vlm-openai.json | 19 +++++++
19 files changed, 162 insertions(+), 358 deletions(-)
diff --git a/README.md b/README.md
index 59b6c6f..05b874a 100644
--- a/README.md
+++ b/README.md
@@ -152,22 +152,27 @@ From version 1.25 and 1.25-full of the image it is now
easier to override the de
So for example if you wish to disable the OCR parser in the full image you
could write a custom configuration:
```
-cat <<EOT >> tika-config.xml
-<?xml version="1.0" encoding="UTF-8"?>
-<properties>
- <parsers>
- <parser class="org.apache.tika.parser.DefaultParser">
- <parser-exclude
class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
- </parser>
- </parsers>
-</properties>
+cat <<EOT >> tika-config.json
+{
+ "parsers": [
+ { "default-parser": {} },
+ { "tesseract-ocr-parser": { "skipOcr": true } }
+ ]
+}
EOT
```
Then by mounting this custom configuration as a volume, you could pass the
command line parameter to load it
- docker run -d -p 127.0.0.1:9998:9998 -v
`pwd`/tika-config.xml:/tika-config.xml apache/tika:2.5.0-full --config
/tika-config.xml
+ docker run -d -p 127.0.0.1:9998:9998 -v
`pwd`/tika-config.json:/tika-config.json apache/tika:<tag>-full -c
/tika-config.json
-You can see more configuration examples
[here](https://tika.apache.org/2.5.0/configuring.html).
+NOTE: Tika 4.x replaced the XML `tika-config.xml` format with JSON
+`tika-config.json` (see TIKA-4544). The XML form above is what 2.x / 3.x
+images expect; if you're pinned to those tags, keep using the XML.
+
+You can see more configuration examples on the
+[Tika website](https://tika.apache.org/) and in the canonical samples under
+`tika-server/tika-server-core/src/test/resources/config-examples/` in the
+source tree.
As of 2.5.0.2, if you'd like to add extra jars from your local `my-jars`
directory to Tika's classpath, mount to `/tika-extras` like so:
@@ -182,10 +187,9 @@ There are a number of sample Docker Compose files included
in the repos to allow
These files use docker-compose 3.x series and include:
-* docker-compose-tika-vision.yml - TensorFlow Inception REST API Vision
examples
+* docker-compose-tika-vision.yml - Vision-Language Model parsing example
(OpenAI-compatible / Claude / Gemini)
* docker-compose-tika-grobid.yml - Grobid REST parsing example
* docker-compose-tika-customocr.yml - Tesseract OCR example with custom
configuration
-* docker-compose-tika-ner.yml - Named Entity Recognition example
The Docker Compose files and configurations (sourced from _sample-configs_
directory) all have comments in them so you can try different options, or use
them as a base to create your own custom configuration.
diff --git a/docker-compose-tika-customocr.yml
b/docker-compose-tika-customocr.yml
index 7428c2d..29cf667 100644
--- a/docker-compose-tika-customocr.yml
+++ b/docker-compose-tika-customocr.yml
@@ -19,16 +19,21 @@ services:
## Apache Tika Server
tika:
image: apache/tika:${TAG}-full
- # Override default so we can add configuration on classpath
- entrypoint: [ "/bin/sh", "-c", "exec java -cp
\"/customocr:/tika-server-standard-$${TIKA_VERSION}.jar:/tika-extras/*\"
org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $$0 $$@"]
+ # Override default so we can add the /customocr dir on the classpath
+ # (for the bundled TesseractOCRConfig.properties). The 4.x image layout
+ # places the thin server jar at /opt/tika-server/tika-server.jar and its
+ # deps at /opt/tika-server/lib/*. working_dir=/opt/tika-server matters for
+ # tika-server's plugin-roots fallback (see
TikaServerProcess#resolveDefaultPluginsDir).
+ entrypoint: [ "/bin/sh", "-c", "exec java -cp
\"/customocr:/opt/tika-server/tika-server.jar:/opt/tika-server/lib/*:/tika-extras/*\"
org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $$0 $$@"]
+ working_dir: /opt/tika-server
# Kept command as example but could be added to entrypoint too
- command: -c /tika-config.xml
+ command: -c /tika-config.json
restart: on-failure
ports:
- "9998:9998"
volumes:
# Choose the configuration you want, or add your own custom one
- # - ./sample-configs/customocr/tika-config-inline.xml:/tika-config.xml
- - ./sample-configs/customocr/tika-config-rendered.xml:/tika-config.xml
+ # - ./sample-configs/customocr/tika-config-inline.json:/tika-config.json
+ - ./sample-configs/customocr/tika-config-rendered.json:/tika-config.json
diff --git a/docker-compose-tika-grobid.yml b/docker-compose-tika-grobid.yml
index 4c056ae..add5d27 100644
--- a/docker-compose-tika-grobid.yml
+++ b/docker-compose-tika-grobid.yml
@@ -19,10 +19,15 @@ services:
## Apache Tika Server
tika:
image: apache/tika:${TAG}-full
- # Override default so we can add configuration on classpath
- entrypoint: [ "/bin/sh", "-c", "exec java -cp
\"/grobid:/tika-server-standard-$${TIKA_VERSION}.jar:/tika-extras/*\"
org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $$0 $$@"]
+ # Override default so we can add the /grobid dir on the classpath
+ # (for the bundled GrobidExtractor.properties). The 4.x image layout
+ # places the thin server jar at /opt/tika-server/tika-server.jar and its
+ # deps at /opt/tika-server/lib/*. working_dir=/opt/tika-server matters for
+ # tika-server's plugin-roots fallback.
+ entrypoint: [ "/bin/sh", "-c", "exec java -cp
\"/grobid:/opt/tika-server/tika-server.jar:/opt/tika-server/lib/*:/tika-extras/*\"
org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $$0 $$@"]
+ working_dir: /opt/tika-server
# Kept command as example but could be added to entrypoint too
- command: -c /grobid/tika-config.xml
+ command: -c /grobid/tika-config.json
restart: on-failure
ports:
- "9998:9998"
diff --git a/docker-compose-tika-ner.yml b/docker-compose-tika-ner.yml
deleted file mode 100644
index 50e896a..0000000
--- a/docker-compose-tika-ner.yml
+++ /dev/null
@@ -1,30 +0,0 @@
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-version: "3.8"
-services:
-
- ## Apache Tika Server
- tika:
- image: apache/tika:${TAG}-full
- # Use custom script as entrypoint to go fetch models and setup recognisers
- entrypoint: [ "/ner/run_tika_server.sh"]
- restart: on-failure
- ports:
- - "9998:9998"
- volumes:
- - ./sample-configs/ner/:/ner/
- environment:
- - TAG
\ No newline at end of file
diff --git a/docker-compose-tika-vision.yml b/docker-compose-tika-vision.yml
index 9e054ec..da01d03 100644
--- a/docker-compose-tika-vision.yml
+++ b/docker-compose-tika-vision.yml
@@ -13,42 +13,50 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-version: "3.8"
+# Vision-Language Model parsing for tika-server (Tika 4.x).
+#
+# The pre-4.x inception-rest / Im2txt / inception-video services and the
+# org.apache.tika.parser.recognition.ObjectRecognitionParser they served
+# have been removed (TIKA-4499 / TIKA-4500). The 4.x replacement is a
+# family of VLM parsers (OpenAI-compatible, Anthropic Claude, Google
+# Gemini). See:
+#
+# docs/modules/ROOT/pages/configuration/parsers/vlm-parsers.adoc
+#
+# This compose demonstrates the OpenAI-compatible variant pointing at a
+# locally-hosted Ollama instance. To use a different VLM:
+# - Swap the mounted tika-config.* for vlm-claude.json or vlm-gemini.json
+# and pass the relevant API key via env (ANTHROPIC_API_KEY /
+# GEMINI_API_KEY).
+# - Drop the vlm-server service block below.
+
services:
-
- ## Apache Tika Server
+
+ ## Apache Tika Server
tika:
image: apache/tika:latest-full
- command: -c /tika-config.xml
+ command: -c /tika-config.json
restart: on-failure
ports:
- "9998:9998"
-
volumes:
- # Replace the below with the configuration you want to use, or with your
own custom one
- # - ./sample-configs/vision/inception-rest.xml:/tika-config.xml
- # - ./sample-configs/vision/inception-rest-video.xml:/tika-config.xml
- - ./sample-configs/vision/inception-rest-caption.xml:/tika-config.xml
-
+ - ./sample-configs/vision/vlm-openai.json:/tika-config.json
+ # - ./sample-configs/vision/vlm-claude.json:/tika-config.json
+ # - ./sample-configs/vision/vlm-gemini.json:/tika-config.json
depends_on:
- # You can comment out any you don't need here and in the Vision Service
section below
- - inception-rest
- - inception-caption
- - inception-video
-
- ## Vision Services
- inception-rest:
- build:
https://raw.githubusercontent.com/dameikle/tika-dockers/patch-1/InceptionRestDockerfile
- ports:
- - "8764:8764"
-
- inception-caption:
- build:
https://raw.githubusercontent.com/dameikle/tika-dockers/patch-1/Im2txtRestDockerfile
- ports:
- - "8765:8764"
+ - vlm-server
- inception-video:
- build:
https://raw.githubusercontent.com/dameikle/tika-dockers/patch-1/InceptionVideoRestDockerfile
+ ## Local OpenAI-compatible VLM endpoint.
+ ## Replace with vLLM, your own FastAPI wrapper, or remove and point
+ ## baseUrl in vlm-openai.json at OpenAI's real API.
+ vlm-server:
+ image: ollama/ollama:latest
ports:
- - "8766:8764"
+ - "8000:11434"
+ # Volumes for pulled models. Uncomment and pull a vision-capable model
+ # (e.g. `docker exec <container> ollama pull llava`) before first use.
+ # volumes:
+ # - ollama-models:/root/.ollama
+# volumes:
+# ollama-models:
diff --git a/sample-configs/customocr/tika-config-inline.json
b/sample-configs/customocr/tika-config-inline.json
new file mode 100644
index 0000000..055e72c
--- /dev/null
+++ b/sample-configs/customocr/tika-config-inline.json
@@ -0,0 +1,11 @@
+{
+ "_comment": "Extract inline images from PDF and OCR them with Tesseract.",
+ "parsers": [
+ { "tesseract-ocr-parser": {} },
+ {
+ "pdf-parser": {
+ "extractInlineImages": true
+ }
+ }
+ ]
+}
diff --git a/sample-configs/customocr/tika-config-inline.xml
b/sample-configs/customocr/tika-config-inline.xml
deleted file mode 100644
index 1c9b613..0000000
--- a/sample-configs/customocr/tika-config-inline.xml
+++ /dev/null
@@ -1,31 +0,0 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?>
-<!--
- ~ Licensed to the Apache Software Foundation (ASF) under one or more
- ~ contributor license agreements. See the NOTICE file distributed with
- ~ this work for additional information regarding copyright ownership.
- ~ The ASF licenses this file to You under the Apache License, Version 2.0
- ~ (the "License"); you may not use this file except in compliance with
- ~ the License. You may obtain a copy of the License at
- ~
- ~ http://www.apache.org/licenses/LICENSE-2.0
- ~
- ~ Unless required by applicable law or agreed to in writing, software
- ~ distributed under the License is distributed on an "AS IS" BASIS,
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
-<properties>
- <parsers>
- <!-- Load TesseractOCRParser (could use DefaultParser if you want
others too) -->
- <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
-
- <!-- Extract and OCR Inline Images in PDF -->
- <parser class="org.apache.tika.parser.pdf.PDFParser">
- <params>
- <param name="extractInlineImages" type="bool">true</param>
- </params>
- </parser>
-
- </parsers>
-</properties>
diff --git a/sample-configs/customocr/tika-config-rendered.json
b/sample-configs/customocr/tika-config-rendered.json
new file mode 100644
index 0000000..45f3d3b
--- /dev/null
+++ b/sample-configs/customocr/tika-config-rendered.json
@@ -0,0 +1,16 @@
+{
+ "_comment": [
+ "Render each PDF page as an image and run Tesseract on it.",
+ "ocrStrategy options: no_ocr, ocr_only, ocr_and_text, auto."
+ ],
+ "parsers": [
+ { "tesseract-ocr-parser": {} },
+ {
+ "pdf-parser": {
+ "ocrStrategy": "ocr_only",
+ "ocrImageType": "rgb",
+ "ocrDPI": 100
+ }
+ }
+ ]
+}
diff --git a/sample-configs/customocr/tika-config-rendered.xml
b/sample-configs/customocr/tika-config-rendered.xml
deleted file mode 100644
index bcd8666..0000000
--- a/sample-configs/customocr/tika-config-rendered.xml
+++ /dev/null
@@ -1,38 +0,0 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?>
-<!--
- ~ Licensed to the Apache Software Foundation (ASF) under one or more
- ~ contributor license agreements. See the NOTICE file distributed with
- ~ this work for additional information regarding copyright ownership.
- ~ The ASF licenses this file to You under the Apache License, Version 2.0
- ~ (the "License"); you may not use this file except in compliance with
- ~ the License. You may obtain a copy of the License at
- ~
- ~ http://www.apache.org/licenses/LICENSE-2.0
- ~
- ~ Unless required by applicable law or agreed to in writing, software
- ~ distributed under the License is distributed on an "AS IS" BASIS,
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
-<properties>
- <parsers>
- <!-- Load TesseractOCRParser (could use DefaultParser if you want
others too) -->
- <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
-
- <!-- OCR on Rendered Pages -->
- <parser class="org.apache.tika.parser.pdf.PDFParser">
- <params>
- <!-- no_ocr - extract text only
- ocr_only - don't extract text and just attempt OCR
- ocr_and_text - extract text and attempt OCR (from Tika
1.24)
- auto - extract text but if < 10 characters try OCR
- -->
- <param name="ocrStrategy" type="string">ocr_only</param>
- <param name="ocrImageType" type="string">rgb</param>
- <param name="ocrDPI" type="int">100</param>
- </params>
- </parser>
-
- </parsers>
-</properties>
diff --git a/sample-configs/grobid/tika-config.json
b/sample-configs/grobid/tika-config.json
new file mode 100644
index 0000000..943ec19
--- /dev/null
+++ b/sample-configs/grobid/tika-config.json
@@ -0,0 +1,10 @@
+{
+ "_comment": "Route PDFs through GROBID (via JournalParser) for
journal-article extraction.",
+ "parsers": [
+ {
+ "journal-parser": {
+ "_mime-include": ["application/pdf"]
+ }
+ }
+ ]
+}
diff --git a/sample-configs/grobid/tika-config.xml
b/sample-configs/grobid/tika-config.xml
deleted file mode 100644
index 5b4aad9..0000000
--- a/sample-configs/grobid/tika-config.xml
+++ /dev/null
@@ -1,24 +0,0 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?>
-<!--
- ~ Licensed to the Apache Software Foundation (ASF) under one or more
- ~ contributor license agreements. See the NOTICE file distributed with
- ~ this work for additional information regarding copyright ownership.
- ~ The ASF licenses this file to You under the Apache License, Version 2.0
- ~ (the "License"); you may not use this file except in compliance with
- ~ the License. You may obtain a copy of the License at
- ~
- ~ http://www.apache.org/licenses/LICENSE-2.0
- ~
- ~ Unless required by applicable law or agreed to in writing, software
- ~ distributed under the License is distributed on an "AS IS" BASIS,
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
-<properties>
- <parsers>
- <parser class="org.apache.tika.parser.journal.JournalParser">
- <mime>application/pdf</mime>
- </parser>
- </parsers>
-</properties>
diff --git a/sample-configs/ner/run_tika_server.sh
b/sample-configs/ner/run_tika_server.sh
deleted file mode 100755
index fb447be..0000000
--- a/sample-configs/ner/run_tika_server.sh
+++ /dev/null
@@ -1,62 +0,0 @@
-#!/bin/bash
-
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-#############################################################################
-# See https://cwiki.apache.org/confluence/display/TIKA/TikaAndNER for details
-# on how to configure additional NER libraries
-#############################################################################
-
-# ------------------------------------
-# Download OpenNLP Models to classpath
-# ------------------------------------
-
-OPENNLP_LOCATION="/ner/org/apache/tika/parser/ner/opennlp"
-URL="http://opennlp.sourceforge.net/models-1.5"
-
-mkdir -p $OPENNLP_LOCATION
-if [ "$(ls -A $OPENNLP_LOCATION/*.bin)" ]; then
- echo "OpenNLP models directory has files, so skipping fetch";
-else
- echo "No OpenNLP models found, so fetching them"
- wget "$URL/en-ner-person.bin" -O $OPENNLP_LOCATION/ner-person.bin
- wget "$URL/en-ner-location.bin" -O $OPENNLP_LOCATION/ner-location.bin
- wget "$URL/en-ner-organization.bin" -O
$OPENNLP_LOCATION/ner-organization.bin;
- wget "$URL/en-ner-date.bin" -O $OPENNLP_LOCATION/ner-date.bin
- wget "$URL/en-ner-time.bin" -O $OPENNLP_LOCATION/ner-time.bin
- wget "$URL/en-ner-percentage.bin" -O
$OPENNLP_LOCATION/ner-percentage.bin
- wget "$URL/en-ner-money.bin" -O $OPENNLP_LOCATION/ner-money.bin
-fi
-
-# --------------------------------------------
-# Create RexExp Example for Email on classpath
-# --------------------------------------------
-REGEXP_LOCATION="/ner/org/apache/tika/parser/ner/regex"
-mkdir -p $REGEXP_LOCATION
-echo
"EMAIL=(?:[a-z0-9!#$%&'*+/=?^_\`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_\`{|}~-]+)*|\"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"
> $REGEXP_LOCATION/ner-regex.txt
-
-
-# -------------------
-# Now run Tika Server
-# -------------------
-
-# Can be a single implementation or comma seperated list for multiple for
"ner.impl.class" property
-RECOGNISERS=org.apache.tika.parser.ner.opennlp.OpenNLPNERecogniser,org.apache.tika.parser.ner.regex.RegexNERecogniser
-# Set classpath to the Tika Server JAR and the /ner folder so it has the
configuration and models from above
-CLASSPATH="/ner:/tika-server-standard-${TIKA_VERSION}.jar:/tika-extras/*"
-# Run the server with the custom configuration ner.impl.class property and
custom /ner/tika-config.xml
-exec java -Dner.impl.class=$RECOGNISERS -cp $CLASSPATH
org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 -c /ner/tika-config.xml
\ No newline at end of file
diff --git a/sample-configs/ner/tika-config.xml
b/sample-configs/ner/tika-config.xml
deleted file mode 100644
index 65d5774..0000000
--- a/sample-configs/ner/tika-config.xml
+++ /dev/null
@@ -1,28 +0,0 @@
-<?xml version="1.0" encoding="UTF-8" standalone="no"?>
-<!--
- ~ Licensed to the Apache Software Foundation (ASF) under one or more
- ~ contributor license agreements. See the NOTICE file distributed with
- ~ this work for additional information regarding copyright ownership.
- ~ The ASF licenses this file to You under the Apache License, Version 2.0
- ~ (the "License"); you may not use this file except in compliance with
- ~ the License. You may obtain a copy of the License at
- ~
- ~ http://www.apache.org/licenses/LICENSE-2.0
- ~
- ~ Unless required by applicable law or agreed to in writing, software
- ~ distributed under the License is distributed on an "AS IS" BASIS,
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
-<properties>
- <parsers>
- <parser class="org.apache.tika.parser.ner.NamedEntityParser">
- <mime>application/pdf</mime>
- <mime>text/plain</mime>
- <mime>text/html</mime>
- <mime>application/xhtml+xml</mime>
- </parser>
- </parsers>
-</properties>
-
diff --git a/sample-configs/vision/inception-rest-caption.xml
b/sample-configs/vision/inception-rest-caption.xml
deleted file mode 100644
index c70c207..0000000
--- a/sample-configs/vision/inception-rest-caption.xml
+++ /dev/null
@@ -1,32 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
- ~ Licensed to the Apache Software Foundation (ASF) under one or more
- ~ contributor license agreements. See the NOTICE file distributed with
- ~ this work for additional information regarding copyright ownership.
- ~ The ASF licenses this file to You under the Apache License, Version 2.0
- ~ (the "License"); you may not use this file except in compliance with
- ~ the License. You may obtain a copy of the License at
- ~
- ~ http://www.apache.org/licenses/LICENSE-2.0
- ~
- ~ Unless required by applicable law or agreed to in writing, software
- ~ distributed under the License is distributed on an "AS IS" BASIS,
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
-<properties>
- <parsers>
- <parser
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
- <mime>image/jpeg</mime>
- <mime>image/png</mime>
- <mime>image/gif</mime>
- <params>
- <param name="apiBaseUri"
type="uri">http://inception-caption:8764/inception/v3</param>
- <param name="captions" type="int">5</param>
- <param name="maxCaptionLength" type="int">15</param>
- <param name="class"
type="string">org.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner</param>
- </params>
- </parser>
- </parsers>
-</properties>
\ No newline at end of file
diff --git a/sample-configs/vision/inception-rest-video.xml
b/sample-configs/vision/inception-rest-video.xml
deleted file mode 100644
index f6a4e6a..0000000
--- a/sample-configs/vision/inception-rest-video.xml
+++ /dev/null
@@ -1,32 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
- ~ Licensed to the Apache Software Foundation (ASF) under one or more
- ~ contributor license agreements. See the NOTICE file distributed with
- ~ this work for additional information regarding copyright ownership.
- ~ The ASF licenses this file to You under the Apache License, Version 2.0
- ~ (the "License"); you may not use this file except in compliance with
- ~ the License. You may obtain a copy of the License at
- ~
- ~ http://www.apache.org/licenses/LICENSE-2.0
- ~
- ~ Unless required by applicable law or agreed to in writing, software
- ~ distributed under the License is distributed on an "AS IS" BASIS,
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
-<properties>
- <parsers>
- <parser
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
- <mime>video/mp4</mime>
- <mime>video/quicktime</mime>
- <params>
- <param name="apiBaseUri"
type="uri">http://inception-video:8764/inception/v4</param>
- <param name="topN" type="int">4</param>
- <param name="minConfidence" type="double">0.015</param>
- <param name="mode" type="string">fixed</param>
- <param name="class"
type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTVideoRecogniser</param>
- </params>
- </parser>
- </parsers>
-</properties>
\ No newline at end of file
diff --git a/sample-configs/vision/inception-rest.xml
b/sample-configs/vision/inception-rest.xml
deleted file mode 100644
index caa6468..0000000
--- a/sample-configs/vision/inception-rest.xml
+++ /dev/null
@@ -1,32 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
- ~ Licensed to the Apache Software Foundation (ASF) under one or more
- ~ contributor license agreements. See the NOTICE file distributed with
- ~ this work for additional information regarding copyright ownership.
- ~ The ASF licenses this file to You under the Apache License, Version 2.0
- ~ (the "License"); you may not use this file except in compliance with
- ~ the License. You may obtain a copy of the License at
- ~
- ~ http://www.apache.org/licenses/LICENSE-2.0
- ~
- ~ Unless required by applicable law or agreed to in writing, software
- ~ distributed under the License is distributed on an "AS IS" BASIS,
- ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- ~ See the License for the specific language governing permissions and
- ~ limitations under the License.
- -->
-<properties>
- <parsers>
- <parser
class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
- <mime>image/jpeg</mime>
- <mime>image/png</mime>
- <mime>image/gif</mime>
- <params>
- <param name="apiBaseUri"
type="uri">http://inception-rest:8764/inception/v4</param>
- <param name="topN" type="int">2</param>
- <param name="minConfidence" type="double">0.015</param>
- <param name="class"
type="string">org.apache.tika.parser.recognition.tf.TensorflowRESTRecogniser</param>
- </params>
- </parser>
- </parsers>
-</properties>
diff --git a/sample-configs/vision/vlm-claude.json
b/sample-configs/vision/vlm-claude.json
new file mode 100644
index 0000000..e233516
--- /dev/null
+++ b/sample-configs/vision/vlm-claude.json
@@ -0,0 +1,18 @@
+{
+ "_comment": [
+ "Vision-Language Model parsing via Anthropic's Claude API.",
+ "Claude can handle OCR images and PDFs natively (no rasterization
needed).",
+ "Set apiKey to your Anthropic API key — DO NOT commit a real key.",
+ "Prefer passing it via the ANTHROPIC_API_KEY env var and substituting it",
+ "at container start, e.g. via an entrypoint shim or sidecar that
templates",
+ "this file. See docs: configuration/parsers/vlm-parsers."
+ ],
+ "parsers": [
+ {
+ "claude-vlm-parser": {
+ "apiKey": "${ANTHROPIC_API_KEY}",
+ "model": "claude-sonnet-4-20250514"
+ }
+ }
+ ]
+}
diff --git a/sample-configs/vision/vlm-gemini.json
b/sample-configs/vision/vlm-gemini.json
new file mode 100644
index 0000000..4c33e69
--- /dev/null
+++ b/sample-configs/vision/vlm-gemini.json
@@ -0,0 +1,17 @@
+{
+ "_comment": [
+ "Vision-Language Model parsing via Google's Gemini generateContent API.",
+ "Gemini can handle OCR images and PDFs natively (no rasterization
needed).",
+ "Set apiKey to your Google AI Studio API key — DO NOT commit a real key.",
+ "Prefer GEMINI_API_KEY env var + a templating entrypoint, similar to the",
+ "Claude config. See docs: configuration/parsers/vlm-parsers."
+ ],
+ "parsers": [
+ {
+ "gemini-vlm-parser": {
+ "apiKey": "${GEMINI_API_KEY}",
+ "model": "gemini-2.5-flash"
+ }
+ }
+ ]
+}
diff --git a/sample-configs/vision/vlm-openai.json
b/sample-configs/vision/vlm-openai.json
new file mode 100644
index 0000000..2a4b675
--- /dev/null
+++ b/sample-configs/vision/vlm-openai.json
@@ -0,0 +1,19 @@
+{
+ "_comment": [
+ "Vision-Language Model parsing via an OpenAI-compatible endpoint.",
+ "Works with self-hosted backends (vLLM, Ollama, a local FastAPI wrapper)",
+ "or against OpenAI's own chat-completions API. Set baseUrl to wherever",
+ "the OpenAI-compatible endpoint is reachable from the tika container.",
+ "If the endpoint requires authentication, also set apiKey.",
+ "See docs: configuration/parsers/vlm-parsers."
+ ],
+ "parsers": [
+ {
+ "openai-vlm-parser": {
+ "baseUrl": "http://vlm-server:8000",
+ "model": "jinaai/jina-vlm",
+ "timeoutSeconds": 300
+ }
+ }
+ ]
+}