Copilot commented on code in PR #2852:
URL: https://github.com/apache/tika/pull/2852#discussion_r3333606622
##########
tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java:
##########
@@ -815,7 +839,8 @@ private void usage() {
out.println(" -l or --language Output only language");
out.println(" -d or --detect Detect document type");
out.println(" --digest=X Include digest X (md2, md5,
sha1,");
- out.println(" sha256, sha384, sha512");
+ out.println(" sha256, sha384, sha512,");
+ out.println(" sha3_256, sha3_384,
sha3_512)");
Review Comment:
The help text advertises SHA3 digests, but `--digest` wires up
`CommonsDigesterFactory`, and `CommonsDigester` throws
`UnsupportedOperationException` for `SHA3_*` algorithms. This makes
`--digest=sha3_256` fail at runtime, so the usage text is currently misleading.
##########
docs/modules/ROOT/pages/using-tika/cli/index.adoc:
##########
@@ -95,22 +118,150 @@ java -jar tika-app.jar [option...] [file|port...]
|Option |Description
|`-x` or `--xml`
-|Output XHTML (default)
+|Output XHTML content (default)
|`-h` or `--html`
-|Output HTML
+|Output HTML content
|`-t` or `--text`
-|Output plain text
+|Output plain text content (body)
|`--md`
-|Output Markdown
+|Output Markdown content (body)
+
+|`-T` or `--text-main`
+|Output plain text — main content only, via the boilerpipe handler
+
+|`-A` or `--text-all`
+|Output all text content
|`-m` or `--metadata`
|Output metadata only
|`-j` or `--json`
-|Output JSON metadata
+|Output metadata in JSON
+
+|`-y` or `--xmp`
+|Output metadata in XMP
+
+|`-J` or `--jsonRecursive`
+|Output metadata and content from all embedded files. Combine with
`-x`/`-h`/`-t`/`-m` to choose the content type (default: `-x`).
+
+|`-r` or `--pretty-print`
+|For JSON, XML, and XHTML output, add newlines and whitespace for readability.
+
+|`-e<X>` or `--encoding=<X>`
+|Use output encoding `<X>` (e.g. `UTF-8`).
+|===
+
+=== Detection and Language
+
+[cols="1,3"]
+|===
+|Option |Description
+
+|`-d` or `--detect`
+|Detect the document type and print the media type.
+
+|`-l` or `--language`
+|Detect and print only the language.
+|===
+
+=== Content Options
+
+[cols="1,3"]
+|===
+|Option |Description
+
+|`-p<X>` or `--password=<X>`
+|Use document password `<X>` (for encrypted PDFs, OOXML, etc.).
+
+|`--digest=<X>`
+|Include a digest of the parsed bytes. Supported: `md2`, `md5`, `sha1`,
`sha256`, `sha384`, `sha512`, `sha3_256`, `sha3_384`, `sha3_512`. See
xref:configuration/digesters.adoc[Digesters] for the underlying providers.
Review Comment:
This option table claims `--digest` supports `sha3_*`, but tika-app's
`--digest` implementation uses `CommonsDigesterFactory` and will throw for
`SHA3_*`. Please remove SHA3 algorithms here (or qualify them as requiring a
bouncy-castle digester configured via JSON).
##########
tika-parsers/tika-parsers-ml/tika-vlm/src/test/resources/config-examples/claude-vlm-basic.json:
##########
@@ -0,0 +1,10 @@
+{
+ "parsers": [
+ {
+ "claude-vlm-parser": {
+ "apiKey": "sk-ant-your-key-here",
+ "model": "claude-sonnet-4-20250514"
Review Comment:
This example uses an API key placeholder that matches a real Anthropic key
prefix (`sk-ant-...`). This can trigger secret scanning and is easy to mistake
for an actual credential. Use a clearly non-key placeholder (e.g.,
`YOUR_ANTHROPIC_API_KEY`).
##########
tika-parsers/tika-parsers-ml/tika-vlm/src/test/resources/config-examples/claude-vlm-full.json:
##########
@@ -0,0 +1,20 @@
+{
+ "parsers": [
+ {
+ "claude-vlm-parser": {
+ "baseUrl": "https://api.anthropic.com",
+ "model": "claude-sonnet-4-20250514",
+ "prompt": "Extract all visible text from this image. Return the text
in markdown format, preserving the original structure (headings, lists, tables,
paragraphs). Do not describe the image. Only return the extracted text.",
+ "maxTokens": 4096,
+ "timeoutSeconds": 300,
+ "apiKey": "sk-ant-your-key-here",
+ "inlineContent": true,
Review Comment:
This example uses an API key placeholder that matches a real Anthropic key
prefix (`sk-ant-...`). This can trigger secret scanning and is easy to mistake
for an actual credential. Use a clearly non-key placeholder (e.g.,
`YOUR_ANTHROPIC_API_KEY`).
##########
tika-parsers/tika-parsers-ml/tika-vlm/src/test/resources/config-examples/vlm-pdf-parsing.json:
##########
@@ -0,0 +1,16 @@
+{
+ "parsers": [
+ {
+ "default-parser": {
+ "exclude": ["pdf-parser"]
+ }
+ },
+ {
+ "claude-vlm-parser": {
+ "apiKey": "sk-ant-your-key-here",
+ "model": "claude-sonnet-4-20250514",
+ "prompt": "Extract all text from this document. Return the text in
markdown format, preserving the original structure (headings, lists, tables,
paragraphs). Do not describe the document. Only return the extracted text."
Review Comment:
This example uses an API key placeholder that matches a real Anthropic key
prefix (`sk-ant-...`). This can trigger secret scanning and is easy to mistake
for an actual credential. Use a clearly non-key placeholder (e.g.,
`YOUR_ANTHROPIC_API_KEY`).
##########
tika-plugins-core/src/main/java/org/apache/tika/plugins/ThreadSafeUnzipper.java:
##########
@@ -71,6 +71,19 @@ public static void unzipPlugin(Path source) throws
IOException {
return;
}
+ // Destination exists but has no completion marker. Possible causes:
+ // a previous extraction was killed mid-stream, the marker was deleted
+ // out from under us, or something other than our extractor put files
+ // there. Without this cleanup the subsequent Files.move() below will
+ // fail with DirectoryNotEmptyException on every run until a human
+ // manually removes the directory. Treat the half-extracted state as
+ // garbage and rebuild.
+ if (Files.exists(destination)) {
+ LOG.warn("destination {} exists without a completion marker; "
+ + "treating as stale partial extraction and removing",
destination);
+ deleteRecursively(destination);
+ }
Review Comment:
This new stale-extraction cleanup behavior (deleting an existing destination
directory when the completion marker is missing) is a significant behavioral
change and isn't covered by existing tests in `tika-plugins-core`. Adding a
unit test that creates a destination dir without the marker and asserts
`unzipPlugin()` cleans it up and successfully re-extracts would help prevent
regressions (especially around Windows/DirectoryNotEmptyException scenarios).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]