Re: New rule for HTML spam, using comments?

Amir 'CG' Caspi Tue, 18 Jun 2013 11:20:52 -0700

Replies to multiple folks below...

At 1:42 PM -0400 06/18/2013, Kris Deugau wrote:

Try opening the on-disk file with Notepad (or your favourite text editor
on *nix).  If you see the same thing you see when you hit the "blah blah
blah" button in Eudora, you should be OK.  If not...

I've done that and I think I see the same thing. Indeed, as Imentioned earlier, I use the on-disk file to pipe into sa-learndirectly. However, I'm not quite sure that I trust Eudora's on-diskfile to actually be what an mbox format is supposed to be, at thispoint. That is, I'm wondering if Eudora's on-disk storage is alsosomehow not correct, compared to what the incoming mail format is.


At 7:44 PM +0200 06/18/2013, Axb wrote:

 a simple, fast & cheap URI rule would catch these
and to make it even more efficient - reject anything with HELO orsender using .pw and save lots of cycles.

Sure, but my problem right now is not the .pw spams specifically.It's the fact that I'm getting different results when the spam isfirst processed compared to when I run it through spamc manually. Insome cases I've done this literally within seconds of receiving thespam and STILL got different scores (see earlier email).


At 1:57 PM -0400 06/18/2013, Ben Johnson wrote:

For the sake of troubleshooting, can you try accessing the mail by some
other means, e.g., opening the file directly from the filesystem?

See reply to Kris above. I think mbox is plaintext, yes... butEudora strips attachments and places them in separate directories sothey are not in the monolithic mbox. That's ONE way Eudora isdifferent than other clients... I'm wondering if there are yet moredifferences, which would explain why the message in the mbox is notidentical to what originally was delivered by the MTA.

How would anything ever be flagged with a score higher than BAYES_00 if
this were to be the problem? Didn't you report a score of BAYES_99 in
one of your tests?

The Bayes99 I reported earlier was from running it manually.However, yes, I do get high Bayes scores on auto-classified spam...I've been perusing my Spam folder (where the MTA dumps anything withX-Spam-Status: YES) and a number of the TP hits from SA do showBayes99 (though they often also show lower scores).

Clearly, sa-learn can parse the Eudora mbox format... I'm justwondering if there's something about it that makes it sufficientlydifferent from the raw mail delivered by the MTA that is confusingBayes.

How are you feeding the messages to sa-learn? Are you not just passing
the email file, e.g., /var/vmail/example.com/...? Why copy from Eudora
and paste into a temporary file when you can just point sa-learn
straight to the message on disk?

Eudora is run on my laptop; SA is run on the server (UNIX).Therefore, I can't point SA directly to the message on disk. Theserver also uses mbox, not maildir, so I can't point to individualmessages, only whole mailboxes.

I do the copy/paste when I want to run individual messagesmanually... not through sa-learn, but through spamc. This is tocheck why some messages seem to get low Bayes scores when deliveredby the MTA... in many cases I get much higher scores when I run itmanually, which is making me question the way my DB is gettingtrained. (For reference, both the MTA and my manual calls arerunning the message through spamc with the same user DB, so they_should_ return identical scores if the manually-fed message isidentical... which is why I'm now starting to think it is NOTidentical, and why I'm now questioning how the training is done.)

When I run the Eudora mbox through training, I literally just copythe mbox from my laptop to the server, then run:


sa-learn --no-sync --progress --spam --mbox Eudora_Junk

(The journal auto-syncs shortly afterwards.)

Actually, I do have one step in there: I have to change the CRlinefeeds that Eudora uses into LF (newline) linefeeds that UNIXuses... but that's the only change and it should be fully compliantwith what's passing through the MTA.

Do you retain your training corpus? This may be one of those instances
in which the best way to debug the problem is to wipe and retrain Bayes.
Of course, that can be a nightmare if you don't retain the messages that
you've trained as ham and spam.

I don't have the corpus because the SA installation that I have isthrough Parallels Pro Control Panel. It initially ships completelyuntrained, and when I deployed it about 6 years ago, I didn't knowmuch about SA nor the need for training. The DB has beenautolearning over the past 5-6 years on its own. It is only withinthe last two months that I've been manually trying to teach it.

So, no, I don't have a corpus of spam and ham on which to train. Ido have a Spam mailbox with about 1000 messages (both TPs and FNs),and of course I have my inbox with about 3300 messages in it (andtens of thousands more in archive folders)... in principle, I couldprobably train on these mboxes. However, if Eudora's mbox formattingis indeed the problem, it means I will need to change how I storethings, like switching email clients (which I should probably doanyway given how ancient and unsupported Eudora is, but you know howhard it is to switch clients), or at the very least changing how theserver is storing/delivering my mail.


I wonder if there's any better way to debug this.

Thanks for all the help so far.

                                                --- Amir

Re: New rule for HTML spam, using comments?

Reply via email to