Michael Parker wrote:
On Mon, May 02, 2005 at 03:44:25PM -0500, Stuart Johnston wrote:

Bookworm wrote:

I've read through the archives several times, and hoped that over the last year or so someone would build the functionality, or at least mention it one way or another - I haven't seen it.

Is there any way to take an already trained Mozilla bayes structure and hand it directly off to SpamAssassin? For me, at least, that would eliminate almost all of the spam my server is receiving - Mozilla spots it instantly, but SpamAssassin is missing at least half.

Here is a project that will export the Mozilla Bayes tokens which would at least be the first step. I'm not sure how hard it would be to then import them into SA.


http://bayesjunktool.mozdev.org/



The bayes backup/restore format is fairly stable and it is pretty easy
to create a restore file from alternate sources (that is one of the
reasons it was written).  It's possibly not documented as well as it
should be, but no one has ever asked before so....

You will need the following bits of information:

1) The Raw Token (which needs to be turned into an SHA1 and then into
a hex representation, which is probably too simple of an explanation
for what is actually going on, so probably needs some more detail and
maybe a helper function in the SA code for those that might want to
attempt such a thing, not to mention a period in this sentence
somewhere.)

2) The atime value for that token - SA bayes works off access times
   for tokens, so you need to know the last time it was useful, in a
   pinch you can use current time but it is not optimal.

3) The ham count for the token

4) The spam count for the token

5) Number of spam msgs learned

6) Number of ham msgs learned

7) List of msg ids and if they were learned as ham or spam (this can
   be optional but no optimal since it would allow for re-learning of
   msgs which could throw off your spam/ham counts)

One you have all that, you throw it into a formatted restore file and
then run sa-learn --restore and you are all set.

If someone has a dump of one of these files, and it's got all the
required information I'd be happy to take a look to see how feasible
it would be.

There are some examples in XML format here:

http://bayesjunktool.mozdev.org/installation.html

Here's a sample:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE tokenfile SYSTEM "trainer_xml.dtd"><tokenfile>
        <good_msgs>38</good_msgs>
        <bad_msgs>320</bad_msgs>
        <token>
                <name>$</name>
                <good>4</good>
                <bad>18</bad>

        </token>
...


atimes and msgids are not included.

Reply via email to