did a sa-learn --dump data and got an output of the following kind. Can any one please help me understand the output.
The dump output is pretty simple.. The token format gets a bit complicated, but even that isn't too bad.
As for the dump output..
0.000 0 108 1103190407 N:H*i:sk:NNfNNNc
The first column is a spam probability.. 0.999 means 99.9% probability of this token appearing in spam, 0.000 means 0 percent. (note: this is just probability for ONE token. SA does a chi-squared combine of these numbers to figure out the overall probability of the whole message)
The second column is the number of times bayes has been trained on a spam message containing the token.
The third column is the number of times bayes has been trained on a nonspam message containing the token.
The fourth is the token itself. SA uses some "prefix" characters for encoding things, but without any prefix, a token is a word in the body of the message.
Now this leads into how do all these prefixes work and what do they mean...
First, some token format prefixes:
-----------------
N: means there are numbers in the token represented by N's (thus allowing match of anything 0-9)
sk: means "skip" ie: the token can have other charachters leading up to it, and does not need to be the start of a "word"
Now some "where the token must appear" prefixes: ------------------------- U* indicates it's the username part of an email address D* indicates it's the domain part of an email address
H indicates the token must be in a header a header. It can be followed by a literal header name (ie: HTo:), or one of the following "short cuts"
%HEADER_NAME_COMPRESSION = ( 'Message-Id' => '*m', 'Message-ID' => '*M', 'Received' => '*r', 'User-Agent' => '*u', 'References' => '*f', 'In-Reply-To' => '*i', 'From' => '*F', 'Reply-To' => '*R', 'Return-Path' => '*p', 'X-Mailer' => '*x', 'X-Authentication-Warning' => '*a', 'Organization' => '*o', 'Organisation' => '*o', 'Content-Type' => '*c', );
So let's translate a few.
0.000 0 108 1103190407 N:H*i:sk:NNfNNNc
Probaility 0%, if the "In-Reply-To" header contains a numeric pattern NNfNNNc (ie: 00f000c thru 99f999c). The token may in the middle of a "word" and does not need to have a whitespace or word boundary before it.
0.978 2 0 1103188668 UNLIKE
97.8% chance of the all-caps word "UNLIKE" appearing in the body of spam.
0.009 0 6 1102997003 U*sambalpur
0.9% chance of spam if there is an email address with the username "sambalpur@" in the message
0.958 1 0 1103003309 H*M:OEBfa62
95.8% chance of spam if the Message-ID header has a word starting with OEBfa62
0.958 1 0 1103171817 Tins
95.8% chance of spam if the word "Tins" appears in the body.
etc...
0.049 0 1 1102985500 D*ms52.hinet.net 0.013 219 25539 1103193138 H*r:Unix 0.027 31 1717 1103192325 N:HX-Qmail-Scanner:N.NNNNNN 0.467 123 219 1103186329 PERSONAL 0.013 0 4 1103027319 HTo:U*Jesrine 0.985 3 0 1103099578 backfiring 0.017 0 3 1103031379 YÓk 0.049 0 1 1102972766 Wspecial 0.958 1 0 1102981540 sk:QHKBAZC