On Tue, 24 Jul 2018, Nick Bright wrote:

On 7/24/2018 9:58 AM, John Hardin wrote:
However, unless you *really* trust the people who are providing training data, you don't train on the submissions without first reviewing them.

Therefore, forwarding as an RFC-822 attachment isn't a deal-killer. You can review the submission and if you approve then save the attachment to the spam or ham training corpus (assuming your MUA allows you to do that).

I think this is the core of the issue I need to deal with. It looks like it's plausible to automate a training system in several ways, using IMAP folders and RFC-822 attachments, but in all cases it comes back to the quality of user submissions.

Exactly.

Since we are an ISP,

This detail wasn't clear up-front (but apologies if I missed it). There was a suggestion that the proper approach for an ISP is per-user Bayes, and the corollary to that is "let them train their Bayes into garbage if they wish to."

there is a wide variety of skill levels of end users, and relying on them to bring in quality training data is... probably not plausible.

I may simply have to source the task of reviewing training data to some of our customer care team, as I don't have time to do it myself.

As a potential middle ground for an ISP:

(1) Keep a hand-vetted training corpus.

(2) After initializing bayes from that corpus, enable autolearning with conservative thresholds (i.e. ham more-negative, spam more-positive than the defaults). Use a scheduled expiry task to avoid scan timeouts from expiry during scanning.

(3) Continue ongoing vetted manual training of FPs and FNs, potentially from a smaller population of trusted users to manage the workload, and add the ongoing vetted training messages to the corpus in (1).

That way you get the benefits of autolearn, while managing the manual review workload and retaining the ability to wipe and retrain to a known-good state if autolearn goes off the rails for some reason.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Look at the people at the top of both efforts. Linus Torvalds is a
  university graduate with a CS degree. Bill Gates is a university
  dropout who bragged about dumpster-diving and using other peoples'
  garbage code as the basis for his code. Maybe that has something to
  do with the difference in quality/security between Linux and
  Windows.                           -- anytwofiveelevenis on Y! SCOX
-----------------------------------------------------------------------
 481 days since the first commercial re-flight of an orbital booster (SpaceX)

Reply via email to