On Thu, 8 Sep 2016, Chip M. wrote:

On Sat, 3 Sep 2016, John Hardin wrote:
I've tweaked the FP avoidance a bit, maybe that will be enough
to get the S/O up high enough to publish it.

John, do you have any detailed info about the Ham hits?

It's possible to look up what rules hit those messages, but to see the content and judge what might need to be changed I'd have to get in touch with the corpus owner and ask them about the messages - whether they were correctly classified as ham or spam, and whether they'd be willing to share them. That may not be possible as ham corpora are often private and sensitive.

To view the rule hits in masscheck, assuming that's of interest:
1. go to the detail page for the rule you're interested in, e.g.:
http://ruleqa.spamassassin.org/20160907-r1759562-n/URI_DATA/detail

2. in the "set 0, broken down by contributor", click on any links in the HAM% column.

You'll see something like:
. 1 /data/archive/ham-misc//1433183357.M606569P40031.fumail03.cleanmail.ch,S=39348,W=40036%3A2,S HTML_MESSAGE,T_DKIM_INVALID,T_FSL_RCVD_EX_3,T_FSL_RCVD_TR_2,T_FSL_RCVD_UT_3,T_KAM_HTML_FONT_INVALID,T_NOT_A_PERSON,T_REMOTE_IMAGE,URI_DATA,URI_TRUNCATED,__ANY_TEXT_ATTACH,__ANY_TEXT_ATTACH_DOC,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__BODY_TEXT_LINE,__BUGGED_IMG,__CT,__CTYPE_CHARSET_QUOTED,__CTYPE_HAS_BOUNDARY,__CTYPE_MULTIPART_ALT,__CTYPE_MULTIPART_ANY,__DKIM_EXISTS,__DOS_HAS_ANY_URI,__DOS_HAS_LIST_UNSUB,__DOS_RCVD_MON,__DOS_RCVD_SUN,__DOS_RELAYED_EXT,__FROM_ENCODED_QP,__FROM_FULL_NAME,__FROM_NEEDS_MIME,__FSL_COUNT_EXTERN,__FSL_COUNT_EXTERN,__FSL_COUNT_EXTERN,__FSL_COUNT_TRUST,__FSL_COUNT_TRUST,__FSL_COUNT_UNTRUST,__FSL_COUNT_UNTRUST,__FSL_COUNT_UNTRUST,__FSL_HAS_LIST_UNSUB,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_CAMPAIGN,__HAS_DATE,__HAS_DKIM_SIGHD,__HAS_DOMAINKEY_SIG,__HAS_FROM,__HAS_MESSAGE_ID,__HAS_MSGID,__HAS_RCVD,__HAS_REPLY_TO,__HAS_SUBJECT,__HAS_TO,__HAS_URI,__HAVE_BOUNCE_RELAYS,__HTML_LINK_IMAGE,__JM_REACTOR_DATE,__LAST_EXTERNAL_RELAY_NO_AUTH,__LAST_UNTRUSTED_RELAY_NO_AUTH,_! _LIST_PARTIAL,__LOCAL_PP_NONPPURL,__MIME_HTML,__MIME_VERSION,__MISSING_REF,__MISSING_REPLY,__MISSING_THREAD,__MSGID_OK_HOST,__NAKED_TO,__NONEMPTY_BODY,__NOT_A_PERSON,__RATWARE_0_TZ_DATE,__RCD_RDNS_MX_MESSY,__REMOTE_IMAGE,__REPLYTO_EXISTS,__SANE_MSGID,__SINGLE_WORD_LINE,__SINGLE_WORD_LINE,__SUBJ_2UPPER,__SUBJ_4LOWER,__SUBJ_HAS_WORDS,__SUBJ_NOT_SHORT,__TAG_EXISTS_BODY,__TAG_EXISTS_HEAD,__TAG_EXISTS_HTML,__TAG_EXISTS_META,__TOCC_EXISTS,__TO_NO_ARROWS_R,__TVD_BODY,__TVD_MIME_ATT_TP,__URI_DATA,__URI_DBL_DOM,__URI_MAILTO time=1433136576,scantime=0,format=f,reuse=no,set=0

...which is identification of the message in their corpora, and a list of all the rules that hit.

I just datamined my three best corpora, from the beginning of
2014 thru this weekend, and found zero FPs, except for two hits
on that "img" test.  My data does NOT prove it's impossible for
anybody else, but it does seem odd, so I'm wondering if the
SA MassCheck mechanism has some means for the contributor to
pull out the corpses of specific hits.

Yes. Given that ID on the first line the corpus owner can find the message in question, review it, potentially fix misclassifications (that has happened before), etc.

There's one more exclusion I can add that will take out the last of the FPs in masscheck.

If it doesn't, that would be a cool feature to add. :)

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  The Constitution is a written instrument. As such its meaning does
  not alter. That which it meant when adopted, it means now.
                    -- U.S. Supreme Court
                       SOUTH CAROLINA v. US, 199 U.S. 437, 448 (1905)
-----------------------------------------------------------------------
 9 days until the 229th anniversary of the signing of the U.S. Constitution

Reply via email to