I also want to mirror Bill's comment of a very detailed but report

On Fri, Mar 4, 2022, 18:05 Ricky Boone <ricky.bo...@gmail.com> wrote:

> Sorry for the late reply, crazy week.
>
> Honestly, I wasn't expecting such a quick and relevant response, so thanks
> and kudos for that.  :)
>
> I'm not currently using trunk, so I will try to patch in the changes
> described during a quiet period over the weekend.  It does look like that
> should do the trick, though.
>
> On Thu, Mar 3, 2022 at 1:48 AM Bill Cole <
> sausers-20150...@billmail.scconsult.com> wrote:
>
>> On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
>> Ricky Boone <ricky.bo...@gmail.com>
>> is rumored to have said:
>>
>> > If this is the wrong forum to report this, let me know.
>>
>> This is fine. I've also documented the fix in our Bugzilla at
>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960
>>
>> If you're running the 'trunk' version out of svn, the fix is in there.
>> We do not even have a target date for the next release, but we generally
>> do not break 'trunk' if you're feeling adventurous.
>>
>> If you're a different sort of adventurous, willing to hack on your local
>> copy of the code, the fix is to remove these lines (~223-224) which skip
>> lines based on an antique assumption:
>>
>> -      # lines containing high bytes will have no data we need, so save
>> some cycles
>> -      next if ($line =~ /[\x80-\xff]/);
>>
>> Thank you very much for the detailed analysis. I had seen this problem
>> on some PDFs but have not had the time to dive into the issue. You
>> vastly reduced the pain of fixing it.
>>
>>
>> > I'm trying to create a couple rules to identify questionable PDFs
>> > (phishing, etc.).  While evaluating the debug output from spamassassin
>> > for
>> > the pdfinfo plugin, I noticed that some of the test file attributes
>> > aren't
>> > being populated correctly, when comparing against exiftool, Adobe
>> > Reader,
>> > Firefox, etc.  The producer and creator fields, specifically, appear
>> > to be
>> > left as unknown.
>> >
>> > Compared against other emails and PDFs, I get similar results, so I
>> > suspect
>> > it's an issue with the plugin or how it is parsing the PDF.  I do have
>> > this
>> > example available, however it is malicious (it links to a phishing
>> > site),
>> > so I wouldn't want to link to it directly in this thread.
>> >
>> > For example:
>> >
>> > $ less Invoice0098539.pdf
>> > %PDF-1.4
>> > 1 0 obj
>> > <<
>> > /Title (<FE><FF>)
>> > /Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@
>> > ^@0^@.^@1^@2^@.^@5)
>> > /Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7)
>>
>> There's the cause. Apparently the use of UTF-16BE encoding with a
>> leading BOM for metadata was not so common when that plugin was written.
>> It saw the BOM and assumed the line was binary data.
>>
>>
>> --
>> Bill Cole
>> b...@scconsult.com or billc...@apache.org
>> (AKA @grumpybozo and many *@billmail.scconsult.com addresses)
>> Not Currently Available For Hire
>>
>

Reply via email to