Hi Gary, If I may ask, is this in a user_pref for a specific user or domain wide? Where is the conf file at? Sorry for the direct mail
* ----------------------------------------- * * Steve Schofield * [EMAIL PROTECTED] * * Microsoft MVP - ASP.NET * http://www.aspfree.com * ----------------------------------------- * ----- Original Message ----- From: "Gary Funck" <[EMAIL PROTECTED]> To: "Spamassassin List" <[EMAIL PROTECTED]> Sent: Tuesday, May 27, 2003 10:54 PM Subject: [SAtalk] Some observations on doing more with less > > I've been experimenting with the effectiveness of limiting the amount of > data that SA uses to make its decisions - by cutting back on the message > size > fed to SA. The results of those experiments are given below. > > Given a sample of 1543 bonafide spam messages that have been trapped > over the past week, the message sizes, including headers are distributed > as follows: > > 0% 774 > 10% 1481 > 20% 1806 > 30% 2132 > 40% 2560 > 50% 3040 > 60% 3687 > 70% 4168 > 80% 5655 > 90% 8723 > 100% 236579 > > Thus 90% of the messages are less than or equal to 8723 bytes in length. > The 95-th percentile is about 15000 bytes. > > On the first trial, I arbitrarily decided to set the message limit to 15000, > thus truncating only the top 5% of the message sample. Further, I fed > only the *first* 15000 bytes through to SA. In procmail, something like > this: > > # If the message is less than 15000 bytes in length, > # handle it the regular way. > > :0fw > * < 15000 > | spamassassin > > # If the message length is >= 15000, feed the first 15000 bytes > # to SA. The "i" qualifier is necessary, to tell procmail it's > # okay that the entire message is not passed through the filter. > # The echo is there to make sure that last line has a newline. > # (note the E qualifier means "else".) > > :0Efiw > | (head -c 15000; echo "") | spamassassin > > > The script above is for normal message processing/delivery. For the > experiment, > the actual script used only local checks, with no auto-whitelisting, > to remove those as variable factors, and then deposited any false > negatives into a mailbox: > > # Remove any SA headers/reports > :0fw > | spamassassin -d > > :0: > * > 14999 > * ? (head -c 15000; echo "") | spamassassin -L -e > big-spam-missed > > The resulting file, big-spam-missed, was compared against a baseline file > where a similar script was run, but without any message truncation: > > # Remove any SA headers/reports > :0fw > | spamassassin -d > > :0: > * > 14999 > * ? spamassassin -L -e > big-spam-all-missed > > The baseline case has a few false negatives, because we're running with only > local checks and no auto white listing, while the original sample was > compiled > with those options enabled. > > After running this initial test, I found that by truncating the large > messages to 15000 bytes and passing only those bytes to Spamassassin that > there were some *additional* false negatives. > > Thus, if we truncate the message and pass only initial 15000 bytes, we lose > vital information, and we generate some false negatives. > > After thinking this over a bit, I decided that perhaps the situation might > be improved by taking the *first* 7500 bytes and the *last* 7500 bytes of > the message and passing the result to SA. The thinking here is that some > spam > may contain detectable spam comment at the *end* of the message, as well > as the beginning. > > The filter looks like this: > > :0: > * > 14999 > * ? (head -c 7500; echo ""; tail -c 7500) | spamassassin -L -e > big-spam-missed > > (It's probably important to add the new line to the beginning text in > order to keep from generating a spurious "long line" diagnostic.) > > By using excerpts from both the first and last parts of the message, > *no additional false negatives* were generated above the baseline > case. > > The same (correct) results were also achieved with a 5000 byte cutoff, which > was about the 80% mark on the spam sample that I was using. However, the > remaining 20% of the sample represented 40% of the total bytes. > > I haven't tried this test on much smaller message excerpts but will do so. > Given the encouraging results at the 5000 byte cut off, it might be possible > for SA to make correct determinations on even smaller message excerpts. > > I was a little concerned that chopping up the message and > feeding it to SA might cause problems because attachment boundaries might > be missing, and/or might not match. Also, Content-Length is clearly going > to be wrong, and so is the LINES: descriptor. Still, even at the 5K cutoff, > Spamassassin gave correct results, and didn't crash on a malformed mail > message. > > Conclusion: It is possible to feed SA only part of a long message, > and have it make an accurate determination of the content, but an > excerpt must contain both the first part and last part of the message. > Passing only the first part of the message loses > vital information and causes SA to make incorrect decisions. > > Suggestion: change the behavior of "spamc -s msg_size" so that it > sends msg_size/2 (with a newline added if necessary) from the beginning > of the file, concatenated with the last msg_size/2 bytes of the file. > Also, add a similar -s option to Spamassassin, to make it easier > and more transparent to feed SA an excerpt from the message. > > > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application fit in a > relational database is painful, don't do it! Check out ObjectStore. > Now part of Progress Software. http://www.objectstore.net/sourceforge > _______________________________________________ > Spamassassin-talk mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/spamassassin-talk > ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk