I also want to mirror Bill's comment of a very detailed but report On Fri, Mar 4, 2022, 18:05 Ricky Boone <ricky.bo...@gmail.com> wrote:
> Sorry for the late reply, crazy week. > > Honestly, I wasn't expecting such a quick and relevant response, so thanks > and kudos for that. :) > > I'm not currently using trunk, so I will try to patch in the changes > described during a quiet period over the weekend. It does look like that > should do the trick, though. > > On Thu, Mar 3, 2022 at 1:48 AM Bill Cole < > sausers-20150...@billmail.scconsult.com> wrote: > >> On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500) >> Ricky Boone <ricky.bo...@gmail.com> >> is rumored to have said: >> >> > If this is the wrong forum to report this, let me know. >> >> This is fine. I've also documented the fix in our Bugzilla at >> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960 >> >> If you're running the 'trunk' version out of svn, the fix is in there. >> We do not even have a target date for the next release, but we generally >> do not break 'trunk' if you're feeling adventurous. >> >> If you're a different sort of adventurous, willing to hack on your local >> copy of the code, the fix is to remove these lines (~223-224) which skip >> lines based on an antique assumption: >> >> - # lines containing high bytes will have no data we need, so save >> some cycles >> - next if ($line =~ /[\x80-\xff]/); >> >> Thank you very much for the detailed analysis. I had seen this problem >> on some PDFs but have not had the time to dive into the issue. You >> vastly reduced the pain of fixing it. >> >> >> > I'm trying to create a couple rules to identify questionable PDFs >> > (phishing, etc.). While evaluating the debug output from spamassassin >> > for >> > the pdfinfo plugin, I noticed that some of the test file attributes >> > aren't >> > being populated correctly, when comparing against exiftool, Adobe >> > Reader, >> > Firefox, etc. The producer and creator fields, specifically, appear >> > to be >> > left as unknown. >> > >> > Compared against other emails and PDFs, I get similar results, so I >> > suspect >> > it's an issue with the plugin or how it is parsing the PDF. I do have >> > this >> > example available, however it is malicious (it links to a phishing >> > site), >> > so I wouldn't want to link to it directly in this thread. >> > >> > For example: >> > >> > $ less Invoice0098539.pdf >> > %PDF-1.4 >> > 1 0 obj >> > << >> > /Title (<FE><FF>) >> > /Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@ >> > ^@0^@.^@1^@2^@.^@5) >> > /Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7) >> >> There's the cause. Apparently the use of UTF-16BE encoding with a >> leading BOM for metadata was not so common when that plugin was written. >> It saw the BOM and assumed the line was binary data. >> >> >> -- >> Bill Cole >> b...@scconsult.com or billc...@apache.org >> (AKA @grumpybozo and many *@billmail.scconsult.com addresses) >> Not Currently Available For Hire >> >