Copilot commented on code in PR #2844:
URL: https://github.com/apache/tika/pull/2844#discussion_r3319658781


##########
docs/modules/ROOT/pages/pipes/troubleshooting.adoc:
##########
@@ -112,6 +143,55 @@ When the watcher fires, the child exits via `System.exit`, 
which runs
 `AbstractExternalProcessParser`'s shutdown hook and cleans up any
 in-flight external subprocesses.
 
+== Log levels and sensitive data
+
+Tika Pipes treats `FetchKey` and `EmitKey` values as potentially sensitive --
+they typically contain file paths, URLs, object-store keys, or other 
identifiers
+that may be private to the data owner. The convention across pipes core and the
+bundled plugins is:
+
+[cols="1,3"]
+|===
+|Level |What is logged
+
+|`ERROR` / `WARN`
+|Failures, exceptions, and configuration problems. *Never* the literal
+ `fetchKey`/`emitKey` or any file content. When a failure refers to a
+ specific document, it is identified by the non-sensitive `FetchEmitTuple.id`
+ (e.g. `parse exception: id=abc-123`).
+
+|`INFO`
+|Lifecycle events -- server start/stop, plugin start/stop, mode banners,
+ restart events. Per-document or per-request lines have been demoted from
+ INFO to DEBUG so production logs stay quiet.
+
+|`DEBUG`
+|Per-document progress and aggregated counts (e.g. `pipesClientId=2,
+ status=PARSE_SUCCESS`, `successfully emitted N docs`). Safe to enable in
+ production for troubleshooting; correlation is by `FetchEmitTuple.id` only.
+
+|`TRACE`
+|Verbose per-fetch and per-emit detail including the literal
+ `fetchKey`/`emitKey` (URL, S3 key, blob path, etc.). Enable only when you
+ need to correlate a Tika log line back to a specific resource, and accept
+ that those keys will appear in the log destination.
+|===
+
+The fetcher and emitter SPIs (`Fetcher.fetch`, `Emitter.emit`,
+`StreamEmitter.emit`) receive the literal key but not the tuple id, so
+plugin code can only log the literal key. Keeping that at TRACE keeps it
+out of any log destination that is configured at DEBUG or higher.

Review Comment:
   This documentation currently overstates the new logging convention: several 
bundled emitters still log literal output paths/keys at DEBUG (for example 
`S3Emitter.emit` logs `path` at DEBUG), so telling users DEBUG is safe and only 
TRACE contains literal keys is inaccurate. Either demote the remaining DEBUG 
key/path logs to TRACE or qualify this guidance until the code matches it.
   



##########
docs/modules/ROOT/pages/pipes/troubleshooting.adoc:
##########
@@ -112,6 +143,55 @@ When the watcher fires, the child exits via `System.exit`, 
which runs
 `AbstractExternalProcessParser`'s shutdown hook and cleans up any
 in-flight external subprocesses.
 
+== Log levels and sensitive data
+
+Tika Pipes treats `FetchKey` and `EmitKey` values as potentially sensitive --
+they typically contain file paths, URLs, object-store keys, or other 
identifiers
+that may be private to the data owner. The convention across pipes core and the
+bundled plugins is:
+
+[cols="1,3"]
+|===
+|Level |What is logged
+
+|`ERROR` / `WARN`
+|Failures, exceptions, and configuration problems. *Never* the literal
+ `fetchKey`/`emitKey` or any file content. When a failure refers to a
+ specific document, it is identified by the non-sensitive `FetchEmitTuple.id`

Review Comment:
   The `WARN` guidance is not accurate yet: existing bundled HTTP-style 
fetchers can still include the target URL/fetch key in warning messages (for 
example connection-shutdown warnings in `HttpFetcher`/`AtlassianJwtFetcher`). 
Please either update those remaining logs or soften the “Never” claim so 
operators do not rely on a guarantee the code does not currently provide.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to