I've been experimenting with the effectiveness of limiting the amount of data that SA uses to make its decisions - by cutting back on the message size fed to SA. The results of those experiments are given below.
Given a sample of 1543 bonafide spam messages that have been trapped over the past week, the message sizes, including headers are distributed as follows: 0% 774 10% 1481 20% 1806 30% 2132 40% 2560 50% 3040 60% 3687 70% 4168 80% 5655 90% 8723 100% 236579 Thus 90% of the messages are less than or equal to 8723 bytes in length. The 95-th percentile is about 15000 bytes. On the first trial, I arbitrarily decided to set the message limit to 15000, thus truncating only the top 5% of the message sample. Further, I fed only the *first* 15000 bytes through to SA. In procmail, something like this: # If the message is less than 15000 bytes in length, # handle it the regular way. :0fw * < 15000 | spamassassin # If the message length is >= 15000, feed the first 15000 bytes # to SA. The "i" qualifier is necessary, to tell procmail it's # okay that the entire message is not passed through the filter. # The echo is there to make sure that last line has a newline. # (note the E qualifier means "else".) :0Efiw | (head -c 15000; echo "") | spamassassin The script above is for normal message processing/delivery. For the experiment, the actual script used only local checks, with no auto-whitelisting, to remove those as variable factors, and then deposited any false negatives into a mailbox: # Remove any SA headers/reports :0fw | spamassassin -d :0: * > 14999 * ? (head -c 15000; echo "") | spamassassin -L -e big-spam-missed The resulting file, big-spam-missed, was compared against a baseline file where a similar script was run, but without any message truncation: # Remove any SA headers/reports :0fw | spamassassin -d :0: * > 14999 * ? spamassassin -L -e big-spam-all-missed The baseline case has a few false negatives, because we're running with only local checks and no auto white listing, while the original sample was compiled with those options enabled. After running this initial test, I found that by truncating the large messages to 15000 bytes and passing only those bytes to Spamassassin that there were some *additional* false negatives. Thus, if we truncate the message and pass only initial 15000 bytes, we lose vital information, and we generate some false negatives. After thinking this over a bit, I decided that perhaps the situation might be improved by taking the *first* 7500 bytes and the *last* 7500 bytes of the message and passing the result to SA. The thinking here is that some spam may contain detectable spam comment at the *end* of the message, as well as the beginning. The filter looks like this: :0: * > 14999 * ? (head -c 7500; echo ""; tail -c 7500) | spamassassin -L -e big-spam-missed (It's probably important to add the new line to the beginning text in order to keep from generating a spurious "long line" diagnostic.) By using excerpts from both the first and last parts of the message, *no additional false negatives* were generated above the baseline case. The same (correct) results were also achieved with a 5000 byte cutoff, which was about the 80% mark on the spam sample that I was using. However, the remaining 20% of the sample represented 40% of the total bytes. I haven't tried this test on much smaller message excerpts but will do so. Given the encouraging results at the 5000 byte cut off, it might be possible for SA to make correct determinations on even smaller message excerpts. I was a little concerned that chopping up the message and feeding it to SA might cause problems because attachment boundaries might be missing, and/or might not match. Also, Content-Length is clearly going to be wrong, and so is the LINES: descriptor. Still, even at the 5K cutoff, Spamassassin gave correct results, and didn't crash on a malformed mail message. Conclusion: It is possible to feed SA only part of a long message, and have it make an accurate determination of the content, but an excerpt must contain both the first part and last part of the message. Passing only the first part of the message loses vital information and causes SA to make incorrect decisions. Suggestion: change the behavior of "spamc -s msg_size" so that it sends msg_size/2 (with a newline added if necessary) from the beginning of the file, concatenated with the last msg_size/2 bytes of the file. Also, add a similar -s option to Spamassassin, to make it easier and more transparent to feed SA an excerpt from the message. ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk