Am 31.05.2016 um 15:21 schrieb Shivram Krishnan:
Here is my scenario. I am using SA as a oracle/ground truth for a
research project. It is generally hard to get hold of a real time mail
corpus

nope, just point a cheap domain to a mailserver accepting all incoming stuff and spread some hidden mail-links to it

so I opted for a service provided by mailinator. Mailinator is a
company which provides users with disposable email ID's and it offers an
API to obtain the mails of the disposable ID's. Unfortunately it
provides the mail in JSON, and SA takes the mail in RFC 2822.

which is a problem - you can't seriously use any classification of something which would never appear that way in a real mailflow

I have written a script which converts JSON to RFC 2822 (though there
are a lot of specifications on the RFC 2822 , I managed to capture most
of them just so that SA has something to work with)

but the received headers are crap

I have also trained SA using sa-learn on known public corpuses like
enron etc.

I use SA, to classify the converted mails from Mailinator as HAM or SPAM.

for example, if a mail is stored in the text file mail.txt I run

                        spamassassin mail.txt

This returns the necessary score for me to decide if it is SPAM or not.

What do you guys suggest me to do in this case? Is there a better way to
do it?

FIRST strip out the new line at the begin which implies "end of headers" and at least generate useable received-headers

frankly i have no idea why bayes classification changes completly with no useful received headers - i started to strip them all with "formail" form our corpus and got unpreictable and not logical results doing bayes-masstest on the corpus

by just strip any header and add a generic one at the begin of the samles things got predictable and as expected

since that day *all samples* have with makes the bayes database also better compressable (on a sane setup with no autoexpire the date don't matter at all)

Received: from mx.example.com (mx.example.com [91.119.73.19])
 for <m...@example.com>; Mon, 9 May 2016 19:20:00 +0200 (CEST)

On Tue, May 31, 2016 at 1:48 AM, Reindl Harald <h.rei...@thelounge.net
<mailto:h.rei...@thelounge.net>> wrote:



    Am 31.05.2016 um 08:18 schrieb Shivram Krishnan:

        It is not on production. I am using this to evaluate spamassassin.


    how will you evaluate something when you slay your setup that way?

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to