I've been experimenting with the effectiveness of limiting the amount of
data that SA uses to make its decisions - by cutting back on the message
size
fed to SA. The results of those experiments are given below.

Given a sample of 1543 bonafide spam messages that have been trapped
over the past week, the message sizes, including headers are distributed
as follows:

       0%    774
      10%   1481
      20%   1806
      30%   2132
      40%   2560
      50%   3040
      60%   3687
      70%   4168
      80%   5655
      90%   8723
     100% 236579

Thus 90% of the messages are less than or equal to 8723 bytes in length.
The 95-th percentile is about 15000 bytes.

On the first trial, I arbitrarily decided to set the message limit to 15000,
thus truncating only the top 5% of the message sample. Further, I fed
only the *first* 15000 bytes through to SA. In procmail, something like
this:

# If the message is less than 15000 bytes in length,
# handle it the regular way.

:0fw
* < 15000
| spamassassin

# If the message length is >= 15000, feed the first 15000 bytes
# to SA. The "i" qualifier is necessary, to tell procmail it's
# okay that the entire message is not passed through the filter.
# The echo is there to make sure that last line has a newline.
# (note the E qualifier means "else".)

:0Efiw
| (head -c 15000; echo "") | spamassassin


The script above is for normal message processing/delivery. For the
experiment,
the actual script used only local checks, with no auto-whitelisting,
to remove those as variable factors, and then deposited any false
negatives into a mailbox:

# Remove any SA headers/reports
:0fw
| spamassassin -d

:0:
* > 14999
* ? (head -c 15000; echo "") | spamassassin -L -e
big-spam-missed

The resulting file, big-spam-missed, was compared against a baseline file
where a similar script was run, but without any message truncation:

# Remove any SA headers/reports
:0fw
| spamassassin -d

:0:
* > 14999
* ? spamassassin -L -e
big-spam-all-missed

The baseline case has a few false negatives, because we're running with only
local checks and no auto white listing, while the original sample was
compiled
with those options enabled.

After running this initial test, I found that by truncating the large
messages to 15000 bytes and passing only those bytes to Spamassassin that
there were some *additional* false negatives.

Thus, if we truncate the message and pass only initial 15000 bytes, we lose
vital information, and we generate some false negatives.

After thinking this over a bit, I decided that perhaps the situation might
be improved by taking the *first* 7500 bytes and the *last* 7500 bytes of
the message and passing the result to SA. The thinking here is that some
spam
may contain detectable spam comment at the *end* of the message, as well
as the beginning.

The filter looks like this:

:0:
* > 14999
* ? (head -c 7500; echo ""; tail -c 7500) | spamassassin -L -e
big-spam-missed

(It's probably important to add the new line to the beginning text in
order to keep from generating a spurious "long line" diagnostic.)

By using excerpts from both the first and last parts of the message,
*no additional false negatives* were generated above the baseline
case.

The same (correct) results were also achieved with a 5000 byte cutoff, which
was about the 80% mark on the spam sample that I was using. However, the
remaining 20% of the sample represented 40% of the total bytes.

I haven't tried this test on much smaller message excerpts but will do so.
Given the encouraging results at the 5000 byte cut off, it might be possible
for SA to make correct determinations on even smaller message excerpts.

I was a little concerned that chopping up the message and
feeding it to SA might cause problems because attachment boundaries might
be missing, and/or might not match. Also, Content-Length is clearly going
to be wrong, and so is the LINES: descriptor. Still, even at the 5K cutoff,
Spamassassin gave correct results, and didn't crash on a malformed mail
message.

Conclusion: It is possible to feed SA only part of a long message,
and have it make an accurate determination of the content, but an
excerpt must contain both the first part and last part of the message.
Passing only the first part of the message loses
vital information and causes SA to make incorrect decisions.

Suggestion: change the behavior of "spamc -s msg_size" so that it
sends msg_size/2 (with a newline added if necessary) from the beginning
of the file, concatenated with the last msg_size/2 bytes of the file.
Also, add a similar -s option to Spamassassin, to make it easier
and more transparent to feed SA an excerpt from the message.






-------------------------------------------------------
This SF.net email is sponsored by: ObjectStore.
If flattening out C++ or Java code to make your application fit in a
relational database is painful, don't do it! Check out ObjectStore.
Now part of Progress Software. http://www.objectstore.net/sourceforge
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to