On Wed, Sep 16, 2009 at 8:16 PM, Daryl C. W. O'Shea <spamassas...@dostech.ca> wrote: [snip] > now hope to do this Thursday/Friday. I should be able to scan my > million or so messages in a day on my cluster.
Wow, that makes me feel inadequate :) I'm struggling to clean up my little ham sample of 3600 messages, and looking at another couple thousand that I'll do if I've got time... Also, I need some advice, if someone can provide it. I'm looking at a message (and I have several like this in my corpus at present) which generates the following log line . 1 /home/gems/ham//cur/n8500ejj019591:2,S MISSING_DATE,MISSING_HEADERS,MISSING_MID,T_FSL_HELO_NON_FQDN_2,__DKIM_DEPENDABLE,__DNS_FROM_RFC_ABUSE,__DOS_DIRECT_TO_MX,__DOS_HAS_ANY_URI,__DOS_RCVD_FRI,__DOS_SINGLE_EXT_RELAY,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_RCVD,__HAS_SUBJECT,__HAVE_BOUNCE_RELAYS,__LAST_EXTERNAL_RELAY_NO_AUTH,__LAST_UNTRUSTED_RELAY_NO_AUTH,__MISSING_REF,__MISSING_REPLY,__MISSING_THREAD,__NONEMPTY_BODY,__NUMBERS_IN_SUBJ,__RCVD_IN_2WEEKS,__RFC_IGNORANT_ENVFROM,__TO_NO_ARROWS_R,__TVD_BODY learn=ham,time=1252108840,scantime=1,format=f,reuse=no,set=1 It's clearly a poorly constructed message, but it's also clearly ham (it originated from an application that someone somewhere in my organization runs). It had one header: Subject. Then a body. Should I leave stuff like this in? I mean, it is ham, but... thanks in advance for any guidance, Austin.