Re: Spam PDF

bgodette Fri, 29 Jun 2007 13:49:41 -0700

John Rudd wrote:
> [EMAIL PROTECTED] wrote:
>>> Actually, it didn't.  The assertion is that if someone else hadn't seen 
>>> this exact message first, then SA wouldn't have caught it.
>> No, the assertion is that if someone else hadn't seen prior abuse from
>> the sending host first (not this exact message), then SA wouldn't have
>> caught that particular message. That assertion happens to be true for
>> the blacklists, and true for BAYES as well since it would have had to
>> have seen headers (since the payload is vastly different) that look like
>> this sending host in the recent past and been told that it was SPAM.
> 
> Your assertion about bayes is not well supported.  It might have been 
> flagged by bayes for reasons that have _NOTHING_ to do with the received 
> headers.


Follow along on this thought experiment. We have a spam run in progress
from fresh zombies that have never sent to your system before. The
payload of the spam run is new and does not match any historical spam
because it's just a base64 encoded attachment (nothing for bayes to
tokenize in the body of the message). The only bayes fodder for such a
message is received headers, from/to/subject, and mime boundaries if the
 bot does not add any additional headers beyond the minimum (iow no
user-agent, x-headers, etc). You *will* not be getting a BAYES_90 or
BAYES_99 from that.

>>> The PBL (which isn't spamtrap fed, it's collected from ISP published 
>>> and/or contributed data) would have caught this based upon issues that 
>>> have nothing at all to do with this message, and most likely nothing at 
>>> all to do with this current round of spam.  It would be based upon the 
>>> host provider's policy that this host shouldn't send email to the internet.
>> Which means, some time, in the past, for whatever reasons that
>> particular IP address did something against someone's policy to end up
>> on that list. The important part being "in the past".
> 
> No, it means that the ISP, or possibly net block user, told Spamhaus 
> "it's an end user IP address, and not a mail server".  There might be 
> _NO_ previous abuse from that IP address, and they'll still be listed. 
> The "policy" here is NOT the recipient's policy, the sendering network 
> owner's policy.

I think you're missing the point when I say "in the past" in relation to
scoring vs blacklists. It doesn't matter why an IP is listed, at some
point in the past that IP had to have been added before you can match
against it.

In the case of PBL it contains more than just organization submitted
netblocks, it also includes ranges added by hand by Spamhaus for ranges
that appear to be end-user IP space that they have received spam from,
which IMO is probably majority of that list at this point. Which is why
I assert that hitting PBL is based on prior abuse (spam sent to Spamhaus
PBL maintainers) more often than not. Fortunately IP addresses already
in PBL tend to actually stay there since there's no automatic
expiration, but it is still a reactionary list.

> It could have been recent abuse from an entirely different message 
> batch.  In other words, maybe that IP sent a standard stock scam 
> yesterday, and today it sent the pdf spam ... and this person was the 
> first one to receive that pdf spam message.  No previous recipient of 
> the same message.  But they'll still be listed at spamcop.

And? That's still a late-receiver effect, this particular message scored
X points because of what Y host did Z minutes ago, where Z could be
days/weeks/months of minutes.

>> Explain how BAYES will have any matching tokens to work on if its from a
>> fresh, never before seen by your system, zombie and there's no message
>> body other than the attachment? All you have to work with is headers
>> which you've never seen before and MIME boundaries which you've never
>> seen before.
> 
> There are more headers than just the received headers.  And, I honestly 
> don't know whether or not an attachment's raw data is analyzed by bayes 
> or not.  My assumption is that it is.

Yes there's from/to/subject and the MIME boundary. To: is ambiguous
since that's going to match both ham and spam. From:, Subject:, and the
MIME boundary can be ambiguous or worse they'll match common ham in the
case of site-wide BAYES. No other headers need be added. So you're left
with receive headers to make a spam/ham choice from and your BAYES
database hasn't ever seen this host before, good luck with that.

And since I mentioned site-wide bayes, as I mentioned more than a year
ago the practice of using ham (e.g. common mailing lists) from the prior
week as filler is continuing to become more and more common. The only
solution to that for bayes is per-user bayes which only works if your
users have a method of training and actually use it correctly (unlike
AOL users that use the junk button to "unsubscribe"), that pretty much
excludes that option from being used by large installations.

>>> Just resting upon BAYES, BOTNET, and PBL, you're not "lucky to have 
>>> caught the message because you're a late receiver".  You've caught the 
>>> message due to a combination of policy, misuse, and historical 
>>> characteristics of spam in general being used to train your system.
>> All of which needs prior examples/reporting of messages similar to the
>> one you're trying to detect, that's what "historical characteristics of
>> spam" means.
> 
> BOTNET does _NOT_ need prior reporting.  And the prior reporting the PBL 
> require has nothing to do with abuse.  Further, BAYES does not depend 
> upon the received headers.  But even if you're right about bayes, your 
> claim that "all of which needs prior..." is at least 2/3 wrong, if not 
> 3/3 wrong.

Yes I failed to exclude BOTNET from that, it's the only score from the
original message that started this that is solid. The reason is because
BOTNET is proactive, all the others are either 100% reactionary or
nearly so (PBL).

Really the only point I'm trying to make here is saying "well X spam is
caught here..." is rather pointless if all its hitting on your system is
blacklists, message hash systems (DCC/Razor/Pyzor/etc), and BAYES and
wouldn't have been marked if you excluded them. If you have a spam like
that it is something that should really be looked at for some method of
detection that doesn't rely on those reactionary methods. Those methods
are really only good at detecting "old" spam and only have a limited
probability of matching "new" spam as the mutation rate increases, which
is in case you hadn't notice.

Fortunately for this particular spam there is an existing proactive
plugin, FuzzyOCR, that can be massaged to deal with PDF stock image spam.

Re: Spam PDF

Reply via email to