Sorry for the late reply, crazy week. Honestly, I wasn't expecting such a quick and relevant response, so thanks and kudos for that. :)
I'm not currently using trunk, so I will try to patch in the changes described during a quiet period over the weekend. It does look like that should do the trick, though. On Thu, Mar 3, 2022 at 1:48 AM Bill Cole < sausers-20150...@billmail.scconsult.com> wrote: > On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500) > Ricky Boone <ricky.bo...@gmail.com> > is rumored to have said: > > > If this is the wrong forum to report this, let me know. > > This is fine. I've also documented the fix in our Bugzilla at > https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960 > > If you're running the 'trunk' version out of svn, the fix is in there. > We do not even have a target date for the next release, but we generally > do not break 'trunk' if you're feeling adventurous. > > If you're a different sort of adventurous, willing to hack on your local > copy of the code, the fix is to remove these lines (~223-224) which skip > lines based on an antique assumption: > > - # lines containing high bytes will have no data we need, so save > some cycles > - next if ($line =~ /[\x80-\xff]/); > > Thank you very much for the detailed analysis. I had seen this problem > on some PDFs but have not had the time to dive into the issue. You > vastly reduced the pain of fixing it. > > > > I'm trying to create a couple rules to identify questionable PDFs > > (phishing, etc.). While evaluating the debug output from spamassassin > > for > > the pdfinfo plugin, I noticed that some of the test file attributes > > aren't > > being populated correctly, when comparing against exiftool, Adobe > > Reader, > > Firefox, etc. The producer and creator fields, specifically, appear > > to be > > left as unknown. > > > > Compared against other emails and PDFs, I get similar results, so I > > suspect > > it's an issue with the plugin or how it is parsing the PDF. I do have > > this > > example available, however it is malicious (it links to a phishing > > site), > > so I wouldn't want to link to it directly in this thread. > > > > For example: > > > > $ less Invoice0098539.pdf > > %PDF-1.4 > > 1 0 obj > > << > > /Title (<FE><FF>) > > /Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@ > > ^@0^@.^@1^@2^@.^@5) > > /Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7) > > There's the cause. Apparently the use of UTF-16BE encoding with a > leading BOM for metadata was not so common when that plugin was written. > It saw the BOM and assumed the line was binary data. > > > -- > Bill Cole > b...@scconsult.com or billc...@apache.org > (AKA @grumpybozo and many *@billmail.scconsult.com addresses) > Not Currently Available For Hire >