Replies to multiple folks below...

At 1:42 PM -0400 06/18/2013, Kris Deugau wrote:
Try opening the on-disk file with Notepad (or your favourite text editor
on *nix).  If you see the same thing you see when you hit the "blah blah
blah" button in Eudora, you should be OK.  If not...

I've done that and I think I see the same thing. Indeed, as I mentioned earlier, I use the on-disk file to pipe into sa-learn directly. However, I'm not quite sure that I trust Eudora's on-disk file to actually be what an mbox format is supposed to be, at this point. That is, I'm wondering if Eudora's on-disk storage is also somehow not correct, compared to what the incoming mail format is.

At 7:44 PM +0200 06/18/2013, Axb wrote:
 a simple, fast & cheap URI rule would catch these

and to make it even more efficient - reject anything with HELO or sender using .pw and save lots of cycles.

Sure, but my problem right now is not the .pw spams specifically. It's the fact that I'm getting different results when the spam is first processed compared to when I run it through spamc manually. In some cases I've done this literally within seconds of receiving the spam and STILL got different scores (see earlier email).

At 1:57 PM -0400 06/18/2013, Ben Johnson wrote:
For the sake of troubleshooting, can you try accessing the mail by some
other means, e.g., opening the file directly from the filesystem?

See reply to Kris above. I think mbox is plaintext, yes... but Eudora strips attachments and places them in separate directories so they are not in the monolithic mbox. That's ONE way Eudora is different than other clients... I'm wondering if there are yet more differences, which would explain why the message in the mbox is not identical to what originally was delivered by the MTA.

How would anything ever be flagged with a score higher than BAYES_00 if
this were to be the problem? Didn't you report a score of BAYES_99 in
one of your tests?

The Bayes99 I reported earlier was from running it manually. However, yes, I do get high Bayes scores on auto-classified spam... I've been perusing my Spam folder (where the MTA dumps anything with X-Spam-Status: YES) and a number of the TP hits from SA do show Bayes99 (though they often also show lower scores).

Clearly, sa-learn can parse the Eudora mbox format... I'm just wondering if there's something about it that makes it sufficiently different from the raw mail delivered by the MTA that is confusing Bayes.

How are you feeding the messages to sa-learn? Are you not just passing
the email file, e.g., /var/vmail/example.com/...? Why copy from Eudora
and paste into a temporary file when you can just point sa-learn
straight to the message on disk?

Eudora is run on my laptop; SA is run on the server (UNIX). Therefore, I can't point SA directly to the message on disk. The server also uses mbox, not maildir, so I can't point to individual messages, only whole mailboxes.

I do the copy/paste when I want to run individual messages manually... not through sa-learn, but through spamc. This is to check why some messages seem to get low Bayes scores when delivered by the MTA... in many cases I get much higher scores when I run it manually, which is making me question the way my DB is getting trained. (For reference, both the MTA and my manual calls are running the message through spamc with the same user DB, so they _should_ return identical scores if the manually-fed message is identical... which is why I'm now starting to think it is NOT identical, and why I'm now questioning how the training is done.)

When I run the Eudora mbox through training, I literally just copy the mbox from my laptop to the server, then run:

sa-learn --no-sync --progress --spam --mbox Eudora_Junk

(The journal auto-syncs shortly afterwards.)

Actually, I do have one step in there: I have to change the CR linefeeds that Eudora uses into LF (newline) linefeeds that UNIX uses... but that's the only change and it should be fully compliant with what's passing through the MTA.


Do you retain your training corpus? This may be one of those instances
in which the best way to debug the problem is to wipe and retrain Bayes.
Of course, that can be a nightmare if you don't retain the messages that
you've trained as ham and spam.

I don't have the corpus because the SA installation that I have is through Parallels Pro Control Panel. It initially ships completely untrained, and when I deployed it about 6 years ago, I didn't know much about SA nor the need for training. The DB has been autolearning over the past 5-6 years on its own. It is only within the last two months that I've been manually trying to teach it.

So, no, I don't have a corpus of spam and ham on which to train. I do have a Spam mailbox with about 1000 messages (both TPs and FNs), and of course I have my inbox with about 3300 messages in it (and tens of thousands more in archive folders)... in principle, I could probably train on these mboxes. However, if Eudora's mbox formatting is indeed the problem, it means I will need to change how I store things, like switching email clients (which I should probably do anyway given how ancient and unsupported Eudora is, but you know how hard it is to switch clients), or at the very least changing how the server is storing/delivering my mail.

I wonder if there's any better way to debug this.

Thanks for all the help so far.

                                                --- Amir

Reply via email to