Hello,

I am new to Tika and not a developer, not subscribed to this list (please keep me on CC), but I think I may have found a bug I'd like to share. I only have access to Tika 2.9.1 and maybe this is an old issue, in which case I apologise up front.

The issue is with the attached email. If processed as is, Tika correctly classifies it as `message/rfc822`. However, if I add another `x` to the `X-Test` header, or change the name of the header to e.g. `X-Test2`, the message will be classified as `text/html`, which is wrong.

Turns out that the `X-Headers` (yes, I know they are deprecated, but…) throw over the detection algorithm if the header name and the contents exceed 138 characters in sum.

Note that it does not matter if I flow the header across multiple lines or not, so for simplicity, the attached message just has it on a single line (which technically is against the RFC SHOULD).

Hope this is a trivial bug, easy to write a test for and track down. Thank you for your work on Tika!

Best,

--
martin krafft | https://matrix.to/#/#madduck:madduck.net
if you are walking on thin ice, you might as well dance!
{: .blockquote }
spamtraps: madduck.bo...@madduck.net
{: .hidden }
--- Begin Message ---
This is a DMS test mail

--
Dr. Martin Krafft Projektleiter (EDV)
--- End Message ---

Reply via email to