Hello,
I am new to Tika and not a developer, not subscribed to this list
(please keep me on CC), but I think I may have found a bug I'd like
to share. I only have access to Tika 2.9.1 and maybe this is an old
issue, in which case I apologise up front.
The issue is with the attached email. If processed as is, Tika
correctly classifies it as `message/rfc822`. However, if I add
another `x` to the `X-Test` header, or change the name of the header
to e.g. `X-Test2`, the message will be classified as `text/html`,
which is wrong.
Turns out that the `X-Headers` (yes, I know they are deprecated,
but…) throw over the detection algorithm if the header name and the
contents exceed 138 characters in sum.
Note that it does not matter if I flow the header across multiple
lines or not, so for simplicity, the attached message just has it on
a single line (which technically is against the RFC SHOULD).
Hope this is a trivial bug, easy to write a test for and track down.
Thank you for your work on Tika!
Best,
--
martin krafft | https://matrix.to/#/#madduck:madduck.net
if you are walking on thin ice, you might as well dance!
{: .blockquote }
spamtraps: madduck.bo...@madduck.net
{: .hidden }
--- Begin Message ---
This is a DMS test mail
--
Dr. Martin Krafft
Projektleiter (EDV)
--- End Message ---