On Thu, 8 Sep 2016, Chip M. wrote:
On Sat, 3 Sep 2016, John Hardin wrote:
I've tweaked the FP avoidance a bit, maybe that will be enough
to get the S/O up high enough to publish it.
John, do you have any detailed info about the Ham hits?
It's possible to look up what rules hit those messages, but to see the
content and judge what might need to be changed I'd have to get in touch
with the corpus owner and ask them about the messages - whether they were
correctly classified as ham or spam, and whether they'd be willing to
share them. That may not be possible as ham corpora are often private and
sensitive.
To view the rule hits in masscheck, assuming that's of interest:
1. go to the detail page for the rule you're interested in, e.g.:
http://ruleqa.spamassassin.org/20160907-r1759562-n/URI_DATA/detail
2. in the "set 0, broken down by contributor", click on any links in the
HAM% column.
You'll see something like:
. 1
/data/archive/ham-misc//1433183357.M606569P40031.fumail03.cleanmail.ch,S=39348,W=40036%3A2,S
HTML_MESSAGE,T_DKIM_INVALID,T_FSL_RCVD_EX_3,T_FSL_RCVD_TR_2,T_FSL_RCVD_UT_3,T_KAM_HTML_FONT_INVALID,T_NOT_A_PERSON,T_REMOTE_IMAGE,URI_DATA,URI_TRUNCATED,__ANY_TEXT_ATTACH,__ANY_TEXT_ATTACH_DOC,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__BUGGED_IMG,__CT,__CTYPE_CHARSET_QUOTED,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__CTYPE_MULTIPART_ANY,__DKIM_EXISTS,__DOS_HAS_ANY_URI,__DOS_HAS_LIST_UNSUB,__DOS_RCVD_MON,__DOS_RCVD_SUN,__DOS_RELAYED_EXT,__FROM_ENCODED_QP,__FROM_FULL_NAME,__FROM_NEEDS_MIME,__FSL_COUNT_EXTERN,__FSL_COUNT_EXTERN,__FSL_COUNT_EXTERN,__FSL_COUNT_TRUST,__FSL_COUNT_TRUST,__FSL_COUNT_UNTRUST,__FSL_COUNT_UNTRUST,__FSL_COUNT_UNTRUST,__FSL_HAS_LIST_UNSUB,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_CAMPAIGN,__HAS_DATE,__HAS_DKIM_SIGHD,__HAS_DOMAINKEY_SIG,__HAS_FROM,__HAS_MESSAGE_ID,__HAS_MSGID,__HAS_RCVD,__HAS_REPLY_TO,__HAS_SUBJECT,__HAS_TO,__HAS_URI,__HAVE_BOUNCE_RELAYS,__HTML_LINK_IMAGE,__JM_REACTOR_DATE,__LAST_EXTERNAL_RELAY_NO_AUTH,__LAST_UNTRUSTED_RELAY_NO_AUTH,_!
_LIST_PARTIAL,__LOCAL_PP_NONPPURL,__MIME_HTML,__MIME_VERSION,__MISSING_REF,__MISSING_REPLY,__MISSING_THREAD,__MSGID_OK_HOST,__NAKED_TO,__NONEMPTY_BODY,__NOT_A_PERSON,__RATWARE_0_TZ_DATE,__RCD_RDNS_MX_MESSY,__REMOTE_IMAGE,__REPLYTO_EXISTS,__SANE_MSGID,__SINGLE_WORD_LINE,__SINGLE_WORD_LINE,__SUBJ_2UPPER,__SUBJ_4LOWER,__SUBJ_HAS_WORDS,__SUBJ_NOT_SHORT,__TAG_EXISTS_BODY,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML,__TAG_EXISTS_META,__TOCC_EXISTS,__TO_NO_ARROWS_R,__TVD_BODY,__TVD_MIME_ATT_TP,__URI_DATA,__URI_DBL_DOM,__URI_MAILTO
time=1433136576,scantime=0,format=f,reuse=no,set=0
...which is identification of the message in their corpora, and a list of
all the rules that hit.
I just datamined my three best corpora, from the beginning of
2014 thru this weekend, and found zero FPs, except for two hits
on that "img" test. My data does NOT prove it's impossible for
anybody else, but it does seem odd, so I'm wondering if the
SA MassCheck mechanism has some means for the contributor to
pull out the corpses of specific hits.
Yes. Given that ID on the first line the corpus owner can find the message
in question, review it, potentially fix misclassifications (that has
happened before), etc.
There's one more exclusion I can add that will take out the last of the
FPs in masscheck.
If it doesn't, that would be a cool feature to add. :)
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The Constitution is a written instrument. As such its meaning does
not alter. That which it meant when adopted, it means now.
-- U.S. Supreme Court
SOUTH CAROLINA v. US, 199 U.S. 437, 448 (1905)
-----------------------------------------------------------------------
9 days until the 229th anniversary of the signing of the U.S. Constitution