On Tue, 24 Jul 2018, Nick Bright wrote:
On 7/24/2018 9:58 AM, John Hardin wrote:
However, unless you *really* trust the people who are providing training
data, you don't train on the submissions without first reviewing them.
Therefore, forwarding as an RFC-822 attachment isn't a deal-killer. You can
review the submission and if you approve then save the attachment to the
spam or ham training corpus (assuming your MUA allows you to do that).
I think this is the core of the issue I need to deal with. It looks like it's
plausible to automate a training system in several ways, using IMAP folders
and RFC-822 attachments, but in all cases it comes back to the quality of
user submissions.
Exactly.
Since we are an ISP,
This detail wasn't clear up-front (but apologies if I missed it). There
was a suggestion that the proper approach for an ISP is per-user Bayes,
and the corollary to that is "let them train their Bayes into garbage if
they wish to."
there is a wide variety of skill
levels of end users, and relying on them to bring in quality training data
is... probably not plausible.
I may simply have to source the task of reviewing training data to some of
our customer care team, as I don't have time to do it myself.
As a potential middle ground for an ISP:
(1) Keep a hand-vetted training corpus.
(2) After initializing bayes from that corpus, enable autolearning with
conservative thresholds (i.e. ham more-negative, spam more-positive than
the defaults). Use a scheduled expiry task to avoid scan timeouts from
expiry during scanning.
(3) Continue ongoing vetted manual training of FPs and FNs, potentially
from a smaller population of trusted users to manage the workload, and add
the ongoing vetted training messages to the corpus in (1).
That way you get the benefits of autolearn, while managing the manual
review workload and retaining the ability to wipe and retrain to a
known-good state if autolearn goes off the rails for some reason.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhar...@impsec.org FALaholic #11174 pgpk -a jhar...@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Look at the people at the top of both efforts. Linus Torvalds is a
university graduate with a CS degree. Bill Gates is a university
dropout who bragged about dumpster-diving and using other peoples'
garbage code as the basis for his code. Maybe that has something to
do with the difference in quality/security between Linux and
Windows. -- anytwofiveelevenis on Y! SCOX
-----------------------------------------------------------------------
481 days since the first commercial re-flight of an orbital booster (SpaceX)