On Thu, 21 Jan 2016, RW wrote:

On Thu, 21 Jan 2016 08:53:10 -0800 (PST)
John Hardin wrote:

There was an improvement in FP and FN from two tokens. The marginal
improvement from three doesn't seem worth it.

The improvement from 2 to 3 is more substantial than from 1 to 2

287/160 = 1.79

160/69  = 2.3

Ugh. I looked at the raw numbers rather than the ratio - sorry.

287/69 looks even better, 4.2

Whether any of this is worth it depends on a lot of things. I don't
think it's even obvious whether 3-word tokenization is more resource
intensive than 2-word. Clearly in the limit where ntokens goes to
infinity  3-word will outperform 2-word at the same database size,
which means that it can achieve the same level of performance with a
smaller database. I've no feeling for what value of ntokens that
switches around.

So it should be configurable, and if you change it you monitor token database size and scan times and FP/FN rate and adjust token expiry to manage, or switch it back to 1 if the improvement costs too much.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Maxim IV: Close air support covereth a multitude of sins.
-----------------------------------------------------------------------
 2 days until John Moses Browning's 161st Birthday

Reply via email to