Re: [SAtalk] Some observations on doing more with less

Steve Schofield Tue, 27 May 2003 21:25:03 -0700

Hi Gary,

If I may ask, is this in a user_pref for a specific user or domain wide?
Where is the conf file at?
Sorry for the direct mail


*  ----------------------------------------- *
*  Steve Schofield
*  [EMAIL PROTECTED]
*
*  Microsoft MVP - ASP.NET
*  http://www.aspfree.com
*  ----------------------------------------- *
----- Original Message ----- 
From: "Gary Funck" <[EMAIL PROTECTED]>
To: "Spamassassin List" <[EMAIL PROTECTED]>
Sent: Tuesday, May 27, 2003 10:54 PM
Subject: [SAtalk] Some observations on doing more with less


>
> I've been experimenting with the effectiveness of limiting the amount of
> data that SA uses to make its decisions - by cutting back on the message
> size
> fed to SA. The results of those experiments are given below.
>
> Given a sample of 1543 bonafide spam messages that have been trapped
> over the past week, the message sizes, including headers are distributed
> as follows:
>
>        0%    774
>       10%   1481
>       20%   1806
>       30%   2132
>       40%   2560
>       50%   3040
>       60%   3687
>       70%   4168
>       80%   5655
>       90%   8723
>      100% 236579
>
> Thus 90% of the messages are less than or equal to 8723 bytes in length.
> The 95-th percentile is about 15000 bytes.
>
> On the first trial, I arbitrarily decided to set the message limit to
15000,
> thus truncating only the top 5% of the message sample. Further, I fed
> only the *first* 15000 bytes through to SA. In procmail, something like
> this:
>
> # If the message is less than 15000 bytes in length,
> # handle it the regular way.
>
> :0fw
> * < 15000
> | spamassassin
>
> # If the message length is >= 15000, feed the first 15000 bytes
> # to SA. The "i" qualifier is necessary, to tell procmail it's
> # okay that the entire message is not passed through the filter.
> # The echo is there to make sure that last line has a newline.
> # (note the E qualifier means "else".)
>
> :0Efiw
> | (head -c 15000; echo "") | spamassassin
>
>
> The script above is for normal message processing/delivery. For the
> experiment,
> the actual script used only local checks, with no auto-whitelisting,
> to remove those as variable factors, and then deposited any false
> negatives into a mailbox:
>
> # Remove any SA headers/reports
> :0fw
> | spamassassin -d
>
> :0:
> * > 14999
> * ? (head -c 15000; echo "") | spamassassin -L -e
> big-spam-missed
>
> The resulting file, big-spam-missed, was compared against a baseline file
> where a similar script was run, but without any message truncation:
>
> # Remove any SA headers/reports
> :0fw
> | spamassassin -d
>
> :0:
> * > 14999
> * ? spamassassin -L -e
> big-spam-all-missed
>
> The baseline case has a few false negatives, because we're running with
only
> local checks and no auto white listing, while the original sample was
> compiled
> with those options enabled.
>
> After running this initial test, I found that by truncating the large
> messages to 15000 bytes and passing only those bytes to Spamassassin that
> there were some *additional* false negatives.
>
> Thus, if we truncate the message and pass only initial 15000 bytes, we
lose
> vital information, and we generate some false negatives.
>
> After thinking this over a bit, I decided that perhaps the situation might
> be improved by taking the *first* 7500 bytes and the *last* 7500 bytes of
> the message and passing the result to SA. The thinking here is that some
> spam
> may contain detectable spam comment at the *end* of the message, as well
> as the beginning.
>
> The filter looks like this:
>
> :0:
> * > 14999
> * ? (head -c 7500; echo ""; tail -c 7500) | spamassassin -L -e
> big-spam-missed
>
> (It's probably important to add the new line to the beginning text in
> order to keep from generating a spurious "long line" diagnostic.)
>
> By using excerpts from both the first and last parts of the message,
> *no additional false negatives* were generated above the baseline
> case.
>
> The same (correct) results were also achieved with a 5000 byte cutoff,
which
> was about the 80% mark on the spam sample that I was using. However, the
> remaining 20% of the sample represented 40% of the total bytes.
>
> I haven't tried this test on much smaller message excerpts but will do so.
> Given the encouraging results at the 5000 byte cut off, it might be
possible
> for SA to make correct determinations on even smaller message excerpts.
>
> I was a little concerned that chopping up the message and
> feeding it to SA might cause problems because attachment boundaries might
> be missing, and/or might not match. Also, Content-Length is clearly going
> to be wrong, and so is the LINES: descriptor. Still, even at the 5K
cutoff,
> Spamassassin gave correct results, and didn't crash on a malformed mail
> message.
>
> Conclusion: It is possible to feed SA only part of a long message,
> and have it make an accurate determination of the content, but an
> excerpt must contain both the first part and last part of the message.
> Passing only the first part of the message loses
> vital information and causes SA to make incorrect decisions.
>
> Suggestion: change the behavior of "spamc -s msg_size" so that it
> sends msg_size/2 (with a newline added if necessary) from the beginning
> of the file, concatenated with the last msg_size/2 bytes of the file.
> Also, add a similar -s option to Spamassassin, to make it easier
> and more transparent to feed SA an excerpt from the message.
>
>
>
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: ObjectStore.
> If flattening out C++ or Java code to make your application fit in a
> relational database is painful, don't do it! Check out ObjectStore.
> Now part of Progress Software. http://www.objectstore.net/sourceforge
> _______________________________________________
> Spamassassin-talk mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/spamassassin-talk
>



-------------------------------------------------------
This SF.net email is sponsored by: ObjectStore.
If flattening out C++ or Java code to make your application fit in a
relational database is painful, don't do it! Check out ObjectStore.
Now part of Progress Software. http://www.objectstore.net/sourceforge
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] Some observations on doing more with less

Reply via email to