Sorry for the late reply, crazy week.

Honestly, I wasn't expecting such a quick and relevant response, so thanks
and kudos for that.  :)

I'm not currently using trunk, so I will try to patch in the changes
described during a quiet period over the weekend.  It does look like that
should do the trick, though.

On Thu, Mar 3, 2022 at 1:48 AM Bill Cole <
sausers-20150...@billmail.scconsult.com> wrote:

> On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
> Ricky Boone <ricky.bo...@gmail.com>
> is rumored to have said:
>
> > If this is the wrong forum to report this, let me know.
>
> This is fine. I've also documented the fix in our Bugzilla at
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960
>
> If you're running the 'trunk' version out of svn, the fix is in there.
> We do not even have a target date for the next release, but we generally
> do not break 'trunk' if you're feeling adventurous.
>
> If you're a different sort of adventurous, willing to hack on your local
> copy of the code, the fix is to remove these lines (~223-224) which skip
> lines based on an antique assumption:
>
> -      # lines containing high bytes will have no data we need, so save
> some cycles
> -      next if ($line =~ /[\x80-\xff]/);
>
> Thank you very much for the detailed analysis. I had seen this problem
> on some PDFs but have not had the time to dive into the issue. You
> vastly reduced the pain of fixing it.
>
>
> > I'm trying to create a couple rules to identify questionable PDFs
> > (phishing, etc.).  While evaluating the debug output from spamassassin
> > for
> > the pdfinfo plugin, I noticed that some of the test file attributes
> > aren't
> > being populated correctly, when comparing against exiftool, Adobe
> > Reader,
> > Firefox, etc.  The producer and creator fields, specifically, appear
> > to be
> > left as unknown.
> >
> > Compared against other emails and PDFs, I get similar results, so I
> > suspect
> > it's an issue with the plugin or how it is parsing the PDF.  I do have
> > this
> > example available, however it is malicious (it links to a phishing
> > site),
> > so I wouldn't want to link to it directly in this thread.
> >
> > For example:
> >
> > $ less Invoice0098539.pdf
> > %PDF-1.4
> > 1 0 obj
> > <<
> > /Title (<FE><FF>)
> > /Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@
> > ^@0^@.^@1^@2^@.^@5)
> > /Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7)
>
> There's the cause. Apparently the use of UTF-16BE encoding with a
> leading BOM for metadata was not so common when that plugin was written.
> It saw the BOM and assumed the line was binary data.
>
>
> --
> Bill Cole
> b...@scconsult.com or billc...@apache.org
> (AKA @grumpybozo and many *@billmail.scconsult.com addresses)
> Not Currently Available For Hire
>

Reply via email to