At 04:29 PM 12/16/2004 +0530, Rakesh wrote:
did a sa-learn --dump data and got an output of the following kind. Can any one please help me understand the output.


The dump output is pretty simple.. The token format gets a bit complicated, but even that isn't too bad.

As for the dump output..
0.000 0 108 1103190407 N:H*i:sk:NNfNNNc


The first column is a spam probability.. 0.999 means 99.9% probability of this token appearing in spam, 0.000 means 0 percent. (note: this is just probability for ONE token. SA does a chi-squared combine of these numbers to figure out the overall probability of the whole message)

The second column is the number of times bayes has been trained on a spam message containing the token.

The third column is the number of times bayes has been trained on a nonspam message containing the token.

The fourth is the token itself. SA uses some "prefix" characters for encoding things, but without any prefix, a token is a word in the body of the message.

Now this leads into how do all these prefixes work and what do they mean...


First, some token format prefixes:
-----------------
N: means there are numbers in the token represented by N's (thus allowing match of anything 0-9)
sk: means "skip" ie: the token can have other charachters leading up to it, and does not need to be the start of a "word"


Now some "where the token must appear" prefixes:
-------------------------
U* indicates it's the username part of an email address
D* indicates it's the domain part of an email address

H indicates the token must be in a header a header. It can be followed by a literal header name (ie: HTo:), or one of the following "short cuts"

%HEADER_NAME_COMPRESSION = (
  'Message-Id'          => '*m',
  'Message-ID'          => '*M',
  'Received'            => '*r',
  'User-Agent'          => '*u',
  'References'          => '*f',
  'In-Reply-To'         => '*i',
  'From'                => '*F',
  'Reply-To'            => '*R',
  'Return-Path'         => '*p',
  'X-Mailer'            => '*x',
  'X-Authentication-Warning' => '*a',
  'Organization'        => '*o',
  'Organisation'        => '*o',
  'Content-Type'        => '*c',
);



So let's translate a few.


0.000 0 108 1103190407 N:H*i:sk:NNfNNNc

Probaility 0%, if the "In-Reply-To" header contains a numeric pattern NNfNNNc (ie: 00f000c thru 99f999c). The token may in the middle of a "word" and does not need to have a whitespace or word boundary before it.


0.978 2 0 1103188668 UNLIKE

97.8% chance of the all-caps word "UNLIKE" appearing in the body of spam.

0.009 0 6 1102997003 U*sambalpur

0.9% chance of spam if there is an email address with the username "sambalpur@" in the message


0.958 1 0 1103003309 H*M:OEBfa62

95.8% chance of spam if the Message-ID header has a word starting with OEBfa62


0.958 1 0 1103171817 Tins

95.8% chance of spam if the word "Tins" appears in the body.


etc...

0.049          0          1 1102985500  D*ms52.hinet.net
0.013        219      25539 1103193138  H*r:Unix
0.027         31       1717 1103192325  N:HX-Qmail-Scanner:N.NNNNNN
0.467        123        219 1103186329  PERSONAL
0.013          0          4 1103027319  HTo:U*Jesrine
0.985          3          0 1103099578  backfiring
0.017          0          3 1103031379  YÓk
0.049          0          1 1102972766  Wspecial
0.958          1          0 1102981540  sk:QHKBAZC



Reply via email to