On Thu, 21 Jan 2016, RW wrote:

On Thu, 21 Jan 2016 14:31:09 +0100
Christian Laußat wrote:

Am 21.01.2016 14:17, schrieb RW:
The FNs dropped from 287 to 69, which I'd call a four-fold
improvement.

The FPs rose from 0 to 1, but that mail was ham quoting a full
spam, so arguably it just did a better job in detecting the
embedded spam.

Yes, but is it really worth the resources? I mean, the database got
13 time larger for 3 word token, and with more words per token it
will grow exponentially.

But if you are training on error it only grows by a factor of 3.1
(13*69/287).  You also have to consider what happens if you simply
reduce the retention time by a factor of 3.1 - that corpus had 4 years
retention so it's unlikely that maintaining a constant size database
would have made much difference in this case. When you train from
corpus the database size is dominated by ephemeral tokens which makes
the situation look worse than it is.

It depends what you want. I don't care about an extra 100 MB
of disk space and a few milliseconds if it gives any measurable
improvement.

Personally I wouldn't like to see Bayes go multi-word because it would
likely end-up as a poor compromise. Two-word tokenization is the
default on DSPAM, but I've not seen anyone advocate using it. I think
it's better to score in an external filter that runs in addition to
Bayes.

There was an improvement in FP and FN from two tokens. The marginal improvement from three doesn't seem worth it.

I'd like to see a SA Bayes config option to select between one-word and two-word tokens.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Public Education: the bureaucratic process of replacing
  an empty mind with a closed one.                          -- Thorax
-----------------------------------------------------------------------
 2 days until John Moses Browning's 161st Birthday

Reply via email to