On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
Ricky Boone <ricky.bo...@gmail.com>
is rumored to have said:

If this is the wrong forum to report this, let me know.

This is fine. I've also documented the fix in our Bugzilla at https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960

If you're running the 'trunk' version out of svn, the fix is in there. We do not even have a target date for the next release, but we generally do not break 'trunk' if you're feeling adventurous.

If you're a different sort of adventurous, willing to hack on your local copy of the code, the fix is to remove these lines (~223-224) which skip lines based on an antique assumption:

- # lines containing high bytes will have no data we need, so save some cycles
-      next if ($line =~ /[\x80-\xff]/);

Thank you very much for the detailed analysis. I had seen this problem on some PDFs but have not had the time to dive into the issue. You vastly reduced the pain of fixing it.


I'm trying to create a couple rules to identify questionable PDFs
(phishing, etc.). While evaluating the debug output from spamassassin for the pdfinfo plugin, I noticed that some of the test file attributes aren't being populated correctly, when comparing against exiftool, Adobe Reader, Firefox, etc. The producer and creator fields, specifically, appear to be
left as unknown.

Compared against other emails and PDFs, I get similar results, so I suspect it's an issue with the plugin or how it is parsing the PDF. I do have this example available, however it is malicious (it links to a phishing site),
so I wouldn't want to link to it directly in this thread.

For example:

$ less Invoice0098539.pdf
%PDF-1.4
1 0 obj
<<
/Title (<FE><FF>)
/Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@ ^@0^@.^@1^@2^@.^@5)
/Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7)

There's the cause. Apparently the use of UTF-16BE encoding with a leading BOM for metadata was not so common when that plugin was written. It saw the BOM and assumed the line was binary data.


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Reply via email to