Am 21.01.2016 um 20:38 schrieb RW:
On Thu, 21 Jan 2016 08:53:10 -0800 (PST)
John Hardin wrote:


There was an improvement in FP and FN from two tokens. The marginal
improvement from three doesn't seem worth it.

The improvement from 2 to 3 is more substantial than from 1 to 2

  287/160 = 1.79

  160/69  = 2.3

Whether any of this is worth it depends on a lot of things. I don't
think it's even obvious whether 3-word tokenization is more resource
intensive than 2-word. Clearly in the limit where ntokens goes to
infinity  3-word will outperform 2-word at the same database size,
which means that it can achieve the same level of performance with a
smaller database. I've no feeling for what value of ntokens that
switches around

if SA would provide a param to add additional like "bayes_multiword_tokens <integer>" i could test it against 80000 messages with different <integer> params and there is also a 700 entry long ignore-list for our daily check which could also be tested automatically if they swap over to BAYES_999 like the rest and all ham-samples still have BAYES_00

i run that tests every night against he whole corpus with a report to detected mis-training when previously as BAYES_999 or BAYES_00 classified samples change their result

that's done with a dedicated SA-instance doing only bayes test and nothing else feeded by "spamc" and parsing the outputs, takes around 1 hour on the current hardware
________________________

the exclude list can be checked with a param isolated and anything which reached BAYES_999 is automatically removed, looks like below (no the worker scripts are not runnining as root)

so the first test would fire that with 2,3,4 word-tokes and look how many samples chnage to BAYES_999 while no ham-samples from the large tests are lose their BAYES_00

i can clone that machine and re-build the whole bayes database from scratch within 15 minutes from the corpus files
________________________

[root@mail-gw:~]$ corpus-stats ignored
NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-14-54-26-20d340f85ff0e415a34776f2ddac2f98.eml
1 / 639 (SPAM: 2016-01-20-14-54-26-20d340f85ff0e415a34776f2ddac2f98.eml)

NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-13-39-17-ecd6cd231935b352cd1c184224987b03.eml
2 / 639 (SPAM: 2016-01-20-13-39-17-ecd6cd231935b352cd1c184224987b03.eml)

NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-13-39-17-41624e1f3a9314bbf56fedfbc3e56e11.eml
3 / 639 (SPAM: 2016-01-20-13-39-17-41624e1f3a9314bbf56fedfbc3e56e11.eml)

NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-ada25ecf2eb04344e23d853bc59a85b2.eml
4 / 639 (SPAM: 2016-01-20-12-17-18-ada25ecf2eb04344e23d853bc59a85b2.eml)

NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-234598268235618c8167a1e9c93701c8.eml
5 / 639 (SPAM: 2016-01-20-12-17-18-234598268235618c8167a1e9c93701c8.eml)

NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-720e314e86bf81550966764d7fd8d802.eml
6 / 639 (SPAM: 2016-01-20-12-17-18-720e314e86bf81550966764d7fd8d802.eml)

NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-34df375a0ac059678e7d053bad31acdc.eml
7 / 639 (SPAM: 2016-01-20-12-17-18-34df375a0ac059678e7d053bad31acdc.eml)


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to