On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
Ricky Boone <ricky.bo...@gmail.com>
is rumored to have said:
If this is the wrong forum to report this, let me know.
This is fine. I've also documented the fix in our Bugzilla at
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960
If you're running the 'trunk' version out of svn, the fix is in there.
We do not even have a target date for the next release, but we generally
do not break 'trunk' if you're feeling adventurous.
If you're a different sort of adventurous, willing to hack on your local
copy of the code, the fix is to remove these lines (~223-224) which skip
lines based on an antique assumption:
- # lines containing high bytes will have no data we need, so save
some cycles
- next if ($line =~ /[\x80-\xff]/);
Thank you very much for the detailed analysis. I had seen this problem
on some PDFs but have not had the time to dive into the issue. You
vastly reduced the pain of fixing it.
I'm trying to create a couple rules to identify questionable PDFs
(phishing, etc.). While evaluating the debug output from spamassassin
for
the pdfinfo plugin, I noticed that some of the test file attributes
aren't
being populated correctly, when comparing against exiftool, Adobe
Reader,
Firefox, etc. The producer and creator fields, specifically, appear
to be
left as unknown.
Compared against other emails and PDFs, I get similar results, so I
suspect
it's an issue with the plugin or how it is parsing the PDF. I do have
this
example available, however it is malicious (it links to a phishing
site),
so I wouldn't want to link to it directly in this thread.
For example:
$ less Invoice0098539.pdf
%PDF-1.4
1 0 obj
<<
/Title (<FE><FF>)
/Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@
^@0^@.^@1^@2^@.^@5)
/Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7)
There's the cause. Apparently the use of UTF-16BE encoding with a
leading BOM for metadata was not so common when that plugin was written.
It saw the BOM and assumed the line was binary data.
--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire