Adam, if you'd like to try these out I'd be very happy ;) masses/bayes-testing/README in the SA svn repository describes how we test new tokenization strategies, in order to pick the ones that actually _work_. (It's quite counterintuitive at times as to what really helps.)
also, there's experimental code to use a multi-word tokenization as part of the OSBF/Winnow plugin, but it's stalled due to a lack of accuracy compared to the existing Bayes code. if you're curious, it lives here, or at least would if the bugzilla was working -- https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686 --j. On Tue, May 12, 2009 at 21:22, Adam Katz <antis...@khopis.com> wrote: > Adam Katz wrote: >>> vi'aqra pr,ofe'ssio,nal matters very much to your s.e,x >>> be self-satisfied - use vi'aqra s<u>per act,i've >>> vi'aqra pr<o>fessional - never forget about your s'e.x >>> test s p a c e d words t w i c e in a line >>> this is an act--i've shown it 5 x, a record! > > Ignore the missing /^test / below ... I truncated it for wrapping... > >>> viaqra professional matters very much to your sex >>> be selfsatisfied use viaqra super active >>> viaqra professional never forget about your sex >>> s p a c e d words t w i c e in a line spaced:spaced spaced:twice >>> this is an active shown it 5 x a record spaced:5xa > >