It looks what my suggested test actually is finding a physical sites
which tend to use large numbers of virtually hosted domains on web servers.
Spammers are merely a subset of this group - but the set I look at the most.
Jdow's point about very long chains of subdomains is real - It is too bad
that there is not a common syntax for "allow anything 1 or N levels deep",
just the "allow anything" case.  Also, Keith "said" subdomains in a context
where it seemed that hosts would be more appropriate (though maybe he did
mean that his users get subdomains, not just virtual hosts - certainly takes
a few tricks not directly built into apache to do that, but it is possible).

        Obviously, I've been biased by looking at more spam domains than clean
ones.  Still, the inherent flaw in that a wildcard allows unlimited levels
of indirection is only one argument against its use - simplicity of sharing
zone files is the best argument for its use for these cases coming up.

        A few interesting points:  It seems that some of the cases mentioned
may be relying on what is documented in BIND9 as bugs in BIND8 and earlier
(e.g. subdomains sharing 'NS' records with parents).  Also, BIND9 make clear
that "wildcards" *cannot* be used with DNSSEC secure zones.  While noone
expects spammers to change to DNSSEC, it seems that longer term all the cited
"legitimate" cases (so far virtual hosting of large numbers of domains and/or
hosts)  would be better performed using either "nsupdate" to add hosts instead
of unlimited subdomains *or* using a LDAP interface like "slapd" *or* using
the BIND9 database capability.  Also wildcards are disallowed for "Link-local
multicast" (think wireless and/or cheap IPv6 link-local only devices).  Worth
mentioning is the historic "bug" in the resolver when a wildcarded CNAME was
used in the domain edu.com and all communication between any ".com" to any
".edu" domain was suddenly broken.

        I guess my only remaining point is (despite everyone affected will
dislike it), since the majority of all email communication is spam, and an
untested but likely scenario is that the majority of spam includes wildcarded
domains while an unknown amount a ham does (which I believe is significant,
but relatively small by comparison), the question becomes not whether the
test is valid, *IT IS*, but what is the FP rate, and what weight should be
assigned to it (clearly it is not going to be in the class of SURBLs, but
it would seem that the amount of email mentioning blog domains at large
virtually hosted sites is vanishingly small).  I would wager that the FP
rate is lower than that for the DNS_FROM_RFC_ABUSE or DNS_FROM_RFC_POST
rules, of which one or the other hits most "free mail" (with the added
tag line of something like "Get your Free email at XYZ.tld") and nearly
every cable Internet operator in the US BTW, locally I lower the scores for
these (and similar local URI rules), then use meta rules to recognize when
more than one is hit, and assign a slightly high than default value in those
cases - Maybe something similar is appropriate (e.g. if bayes > 60% *and*
wildcards are used add X points, but wildcards alone are only scored as Y
points) - I actually use many rules like this and think that given time and
a larger corpus to check against, the default SA system should do likewise.

        Anyway, we clearly have found a common case (physically large sites
with large numbers of virtually hosted domain/sites) which will FP on a
rule such as I originally proposed.  Though, I don't think it comes anywhere
close to the "Middle Initial" rule - a test mass check run would show
quickly if there is any merit.  I would expect a very high SPAM%, a small but
significant HAM% and an S/O ratio we can only guess at.  I would also expect
a low "overlap" rate which would argue for the value of meta-rules to reduce
an cost of FPs.


        Paul Shupak
        [EMAIL PROTECTED]

Reply via email to