Re: [spamassassin] Re: How to report 120,000 spams

mouss Sun, 09 Mar 2008 19:02:33 -0700

Tuc at T-B-O-H.NET wrote:

If you are proposing some kind of checksums or other types of 'message
identifying' techniques on the messages,  those few mistyped addresses
could certainly make a difference for your site.   What if bongo's mom
mistypes to bungo, realizes her mistake and resends it to bongo a few
minutes later.  It is quite likely that the valid message will be
rejected now since it's (almost) identical to the one your proposed
system just marked as spam.  What if bongo signs up for the a mailing
list and mistypes his own email address (yes, this happens).  Now your
system marks all list mailings as spam, so everyone using your system
starts losing their copies of the mailing list messages too?

        Bango said that if his mom can't spell his name right, he doesn't
care if he gets her emails. :)

fair enough (he can also discard delivered mail anyway). but I've seen alot of people subscribing to services with a mistyped address (theirown) and then calling us to complain why they didn't get theconfirmation request...

anyway, your "corpus" is probably usable provided one uses heuristics toavoid hitting possible ham (or example by computing a distance betweenthe recipient address and your valid addresses to make sure therecipient address is not mistyped, ... etc). but I still believe itshould be "reduced" by rejecting mail at smtp time and only keeping someselected "trap" addresses (for example /[EMAIL PROTECTED]/ to catchattempts to use a phone-like address).

        I'm not proposing anything. I originally wanted to see if there
was some way that these 120,000 emails that don't go to a valid/usable
end user could be used to help the community out in some way. I had 2
filtering systems agree to do something with them, but for reasons I'd
rather not share neither one worked out. (One may still yet, I'm not
sure, waiting to hear back)

        We also don't do sitewide Bayes/etc. We do it per received user.
For this domain, it just happens that all 4 users of the domain
constitute a single received user. I realize that collectively this

list could propose well over 5000 reasons that make sense why "good"mail could be part of that 120,000. I just didn't think the ever so

insignificant percentage mattered. For as much as spam gets through,
and good mail gets marked bad also, I thought this was "acceptable".

I think you have good intentions but the source of your data is flawed
for anything but maybe limited statistical training.  Unfortunately it
probably is not great for that either, since the mail you are seeing
for non existent users is probably not at all similar to the mix of
spam you get to real accounts.  The scanner would end up biased
towards whatever junk the spammers desperate enough to use
dictionaries send, which would drown out the stats from those spams
that are actually difficult to detect.

        Ok, very valid point that makes alot of sense. Thank you.

Why do you accept messages for non existent accounts?  You're wasting
bandwidth, regardless of what you do or don't do with the junk after
you accept it.  From the sound of it you could reduce your mail
bandwidth to a tiny fraction of what it is now by just refusing this
stuff (which is what most everyone else does, AFAIK).

        How do you do it on MX hosts? I realize that if I stop
the wildcard acceptance and stop copying errors to postmaster that
I can do it on the destination server. However, due to circumstances
out of my control for the next few months, all email arrives to the
main mail server via MXs ONLY.

                Thanks, Tuc

Re: [spamassassin] Re: How to report 120,000 spams

Reply via email to