On Friday, June 18, 2004, 5:13:27 AM, Markus wrote: MG> Maybe Pete can provide some tips what would be good combinations.
MG> Like IP4R + SNIFFER = good because SNIFFER make's no DNS lookups MG> But not FILTERX + SNIFFER because SNIFFER checks for this already. That's a tough one. SNIFFER is intended to be comprehensive and continues to grow in that way every day. Another way to say that is that SNIFFER already has a lot of overlap with the tests that are available in Declude and other packages. ( Our goal is for SNIFFER to be as comprehensive, efficient, and dynamic as possible. ) For example, a recent R&D process that has been added to SNIFFER is cross-referencing with SBL/ROKSO. As a result, when a spam hits our spamtraps and matches SBL/ROKSO then we will encode that IP range rather than the single IP - this gives the rulebase some efficiency because a hand-full of rules covers many hundreds (or thousands) of IPS. Note: this does not mean that SNIFFER is any substitue for SBL or any other list - far from it. We use many resources when researching and developing our rulebase and we do not aggregate other services (at least not yet). The best way to look at SNIFFER's overlaps with other resources is as an additional vote of confidence - this is how we add value. A match that overlaps another service indicates that not only did that service's R&D develop that rule, but SNIFFER's R&D did also. The overlaps are therefore a stronger indication than where there are no overlaps. Another example of an overlap is bogons (IPs that are not usable). There are lists and tests for these. Most of the known bogons are included in SNIFFER. Another example is the basic IP rule process - each message that hits a spam trap is generally coded for content rules and also it's individual source IP. As a result we encode a large volume of zombies in near real-time. (Note that nothing gets coded without review - so in order for an IP rule to be coded it must at least hit a clean spamtrap and be recognized as spam, and in general it will also match one or more DNSBLs.) There are R&D processes for broken or spamware generated headers such as the recent 9[2 variety. ... I could go on for quite a while, but the point is that there is a lot of overlap with other test and the overlap is likely to continue to grow over time. After all, people are constantly finding new ways to identify spam... and so are we... so we're going to land on the same ground quite a bit. A case in point - the recent development of SURBL is based on a premise that has been at the core of SNIFFER since it began many years ago. While SURBL cannot capture variations and URI patterns the way SNIFFER does, there is clearly a lot of overlap. While SNIFFER is able to capture a much broader spectrum of URI than SURBL, there are still many cases where SURBL might detect the URI before we do. With SURBL and all other lists you should make an effort to determine if the test provides sufficient benefit for your system. In many cases, SNIFFER will be strong enough that the other test is not needed. You should always try things and see how they perform on your system before making a decision. As you point out - one key piece of the equation will be performance. For example, it has been reported that SNIFFER's accuracy is comparable to SpamAssassin "right out of the box". SNIFFER typically scans a message in 100ms or so. There are frequent reports on the SA list of SA requiring on the order of 10 seconds or more! This is largely due to SA's heavy use of DNSBLs but also the fact that SNIFFER's pattern matching engine is superior to SA's. If you find that a DNSBL test has a high overlap with SNIFFER then it's probably a good decision to drop the DNSBL test in exchange for better performance. If you have a number of content rules and you use SNIFFER then you should strongly consider moving those rules into SNIFFER (let us know and we can code them for you). Just keep in mind that in general the current design of SNIFFER is a collection of white/black rules - there's not a lot of room for "fuzzy" rules and there is no internal weighting system. (that's in the next version). -- so what to recommend? I think that the best approach seems to be to use SNIFFER as a strongly weighted opinion, and to use other tests to finally tip the balance over. Many of our users tell us that SNIFFER plus any other test = spam on their system. * Even better, and strongly dependent upon your own system's requirements, you may find that many rule groups in SNIFFER are sufficient to hold or even delete a message while others may require the addition of another test to push a message over the edge. The other test could be a DNSBL or a specialized content filter. Since SNIFFER is very efficient (now scanning messages in under 200ms consistently) and very accurate (most report > 95% accuracy) you might implement parts of SNIFFER so that if a match is found no other tests are run - or at least not expensive ones. If you leave the SNIFFER tests that are more risky out of this policy then you can have a very effective system that can handle a very high volume. -- I think that what Matt does is very interesting. The content rules that he codes cover some current weaknesses in SNIFFER. For example, the current pattern matching engine in SNIFFER doesn't have a strong mechanism for counterbalancing one pattern against another. (An upcoming version solves this problem.) So Matt's rules tend to cover this ground very well by playing tricks with Declude's weighting system. Another example is header analysis. SNIFFER is good at recognizing when something is there, but the current version doesn't "know" or can't "see" when something is missing. Some of Declude's tests cover this ground. There are other counterbalancing mechanisms that are not yet included in SNIFFER. For example automatic white-listing is a strong addition that can mitigate false positives - particularly on systems that are not business oriented... (Sometimes, friends and lovers talk dirty to each other - and SNIFFER is likely to capture that since the majority of SNIFFER's users are business oriented.) --- There are a lot of folks reporting extremely strong results with SNIFFER. The vast majority of them employ as many tests as they can get and balance them off of each-other. Each implementation is a bit different - so there does not seem to be a single "best practice" for this. There are even some systems that use SNIFFER almost exclusively - though these tend to be very specialized. I think the way I would go with Declude is to find an example configuration that comes from a system that is close to your own and then tweak it for best performance. That seems to work well, and it seems to be the Declude way. --- * There is an emerging challenge with our Experimental rule group. The Experimental rule group contains Experimental Received IPs. Early in this groups development there were quite a few false positives. While the false positive rate for this group has fallen off quite a bit there are still a few from time to time. However, this group has become quite strong lately - especially at identifying new zombies. We frequently identify new zombies before they are listed in other services. Unfortunately since it is typical for at least one other service to list an IP before that hit can be trusted a great many spam may get through before blocking becomes effective. Another challenge with this group is that all broad and unusual heuristics are coded there. These rules can be very effective at capturing spam and malware generated probes, etc... however, they are "risky" because they do not match precise patterns that can be verified in any corpus. This creates a challenge when setting an appropriate weight for this group. On the one hand, these can be some of the most effective rules in the system. On the other hand, they can also be a high FP risk due to something unknown. ++ An example is a recent rule that matched messages from a popular Mac email client. The rule was coded for a number of spam that showed what appeared to be an obvious forgery in the received: headers. When we tested the rule on our corpus it looked good. Unfortunately we didn't have any messages from this Mac client in our ham corpus - and the rule began to produce a very unusual and focused string of false positives. We detected the problem and removed the rule quickly (as always), but the false positives did occur none the less. This is a pretty good real-world representation of the risk involved in these kinds of rules and why we code then in the experimental group. Generally, it is time to increase the weight of the experimental rule group since the FP rate has fallen quite a bit... but finding the correct balance is going to require some trial and error, and on each system's ability to absorb some risk. Systems that hold messages for a period of time have a much easier time with this kind of risk. For example, holding messages based on the experimental group is probably reasonable for many systems since the holding bin provides a good recovery mechanism if a FP problem is detected. --- ++ One of the weird problems we've been facing lately is that SNIFFER is so accurate it tends to be weighted very highly. As a result, our customers have become extremely sensitive to any errors that might occur. For example, it's common discussion that spamcop might list a major ISP... but since this kind of error is relatively common, spamcop is generally weighted so that the error can be tolerated. Since this kind of thing is extremely rare with SNIFFER, a similar error - no matter how short lived - can cause significant problems. On the other side of the coin, if SNIFFER stops working for some reason (such as a bad, unchecked rulebase) then the resulting torrent of spam can bring a system to it's knees! We have largely eliminated the possibility of this kind of failure through strong integrity checking tools and upgrades to our servers, but the underlying message cannot be overstated, and it is one of the core design strategies in Declude: Strength through diversity. --- Sorry for the length... Hope this helps, _M --- [This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)] --- This E-mail came from the Declude.JunkMail mailing list. To unsubscribe, just send an E-mail to [EMAIL PROTECTED], and type "unsubscribe Declude.JunkMail". The archives can be found at http://www.mail-archive.com.
