Tuc at T-B-O-H.NET wrote:
If you are proposing some kind of checksums or other types of 'message
identifying' techniques on the messages,  those few mistyped addresses
could certainly make a difference for your site.   What if bongo's mom
mistypes to bungo, realizes her mistake and resends it to bongo a few
minutes later.  It is quite likely that the valid message will be
rejected now since it's (almost) identical to the one your proposed
system just marked as spam.  What if bongo signs up for the a mailing
list and mistypes his own email address (yes, this happens).  Now your
system marks all list mailings as spam, so everyone using your system
starts losing their copies of the mailing list messages too?

        Bango said that if his mom can't spell his name right, he doesn't
care if he gets her emails. :)

fair enough (he can also discard delivered mail anyway). but I've seen a lot of people subscribing to services with a mistyped address (their own) and then calling us to complain why they didn't get the confirmation request...

anyway, your "corpus" is probably usable provided one uses heuristics to avoid hitting possible ham (or example by computing a distance between the recipient address and your valid addresses to make sure the recipient address is not mistyped, ... etc). but I still believe it should be "reduced" by rejecting mail at smtp time and only keeping some selected "trap" addresses (for example /[EMAIL PROTECTED]/ to catch attempts to use a phone-like address).

        I'm not proposing anything. I originally wanted to see if there
was some way that these 120,000 emails that don't go to a valid/usable
end user could be used to help the community out in some way. I had 2
filtering systems agree to do something with them, but for reasons I'd
rather not share neither one worked out. (One may still yet, I'm not
sure, waiting to hear back)

        We also don't do sitewide Bayes/etc. We do it per received user.
For this domain, it just happens that all 4 users of the domain
constitute a single received user. I realize that collectively this
list could propose well over 5000 reasons that make sense why "good" mail could be part of that 120,000. I just didn't think the ever so
insignificant percentage mattered. For as much as spam gets through,
and good mail gets marked bad also, I thought this was "acceptable".
I think you have good intentions but the source of your data is flawed
for anything but maybe limited statistical training.  Unfortunately it
probably is not great for that either, since the mail you are seeing
for non existent users is probably not at all similar to the mix of
spam you get to real accounts.  The scanner would end up biased
towards whatever junk the spammers desperate enough to use
dictionaries send, which would drown out the stats from those spams
that are actually difficult to detect.

        Ok, very valid point that makes alot of sense. Thank you.
Why do you accept messages for non existent accounts?  You're wasting
bandwidth, regardless of what you do or don't do with the junk after
you accept it.  From the sound of it you could reduce your mail
bandwidth to a tiny fraction of what it is now by just refusing this
stuff (which is what most everyone else does, AFAIK).

        How do you do it on MX hosts? I realize that if I stop
the wildcard acceptance and stop copying errors to postmaster that
I can do it on the destination server. However, due to circumstances
out of my control for the next few months, all email arrives to the
main mail server via MXs ONLY.

                Thanks, Tuc

Reply via email to