Am 21.01.2016 um 20:38 schrieb RW:
On Thu, 21 Jan 2016 08:53:10 -0800 (PST) John Hardin wrote:There was an improvement in FP and FN from two tokens. The marginal improvement from three doesn't seem worth it.The improvement from 2 to 3 is more substantial than from 1 to 2 287/160 = 1.79 160/69 = 2.3 Whether any of this is worth it depends on a lot of things. I don't think it's even obvious whether 3-word tokenization is more resource intensive than 2-word. Clearly in the limit where ntokens goes to infinity 3-word will outperform 2-word at the same database size, which means that it can achieve the same level of performance with a smaller database. I've no feeling for what value of ntokens that switches around
if SA would provide a param to add additional like "bayes_multiword_tokens <integer>" i could test it against 80000 messages with different <integer> params and there is also a 700 entry long ignore-list for our daily check which could also be tested automatically if they swap over to BAYES_999 like the rest and all ham-samples still have BAYES_00
i run that tests every night against he whole corpus with a report to detected mis-training when previously as BAYES_999 or BAYES_00 classified samples change their result
that's done with a dedicated SA-instance doing only bayes test and nothing else feeded by "spamc" and parsing the outputs, takes around 1 hour on the current hardware
________________________the exclude list can be checked with a param isolated and anything which reached BAYES_999 is automatically removed, looks like below (no the worker scripts are not runnining as root)
so the first test would fire that with 2,3,4 word-tokes and look how many samples chnage to BAYES_999 while no ham-samples from the large tests are lose their BAYES_00
i can clone that machine and re-build the whole bayes database from scratch within 15 minutes from the corpus files
________________________ [root@mail-gw:~]$ corpus-stats ignoredNON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-14-54-26-20d340f85ff0e415a34776f2ddac2f98.eml
1 / 639 (SPAM: 2016-01-20-14-54-26-20d340f85ff0e415a34776f2ddac2f98.eml)NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-13-39-17-ecd6cd231935b352cd1c184224987b03.eml
2 / 639 (SPAM: 2016-01-20-13-39-17-ecd6cd231935b352cd1c184224987b03.eml)NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-13-39-17-41624e1f3a9314bbf56fedfbc3e56e11.eml
3 / 639 (SPAM: 2016-01-20-13-39-17-41624e1f3a9314bbf56fedfbc3e56e11.eml)NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-ada25ecf2eb04344e23d853bc59a85b2.eml
4 / 639 (SPAM: 2016-01-20-12-17-18-ada25ecf2eb04344e23d853bc59a85b2.eml)NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-234598268235618c8167a1e9c93701c8.eml
5 / 639 (SPAM: 2016-01-20-12-17-18-234598268235618c8167a1e9c93701c8.eml)NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-720e314e86bf81550966764d7fd8d802.eml
6 / 639 (SPAM: 2016-01-20-12-17-18-720e314e86bf81550966764d7fd8d802.eml)NON-BAYES-999: /var/lib/spamass-milter/training/spam/2016-01-20-12-17-18-34df375a0ac059678e7d053bad31acdc.eml
7 / 639 (SPAM: 2016-01-20-12-17-18-34df375a0ac059678e7d053bad31acdc.eml)
signature.asc
Description: OpenPGP digital signature