On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" <sausers-20150...@billmail.scconsult.com> wrote:
> So you could have 'sex' and 'meds' and 'watches' tallied up in into > frequency counts that sum up natural (word) and synthetic (concept) > occurrences, not just as incompatible types of input feature but as > a conflation of incompatible features. That is easy to patch by giving "concepts" a separate namespace. You could do that by picking a character that can't be in a normal token and using something like: concept*meds, concept*sex, etc. as tokens. > FWIW, I have roughly no free time for anything between work and > family demands but if I did, I would most like to build a blind > fixed-length tokenization Bayes classifier: just slice up a message > into all of its n-byte sequences (so that a message of bytelength x > would have x-(n-1) different tokens) and use those as inputs instead > of words. I think that could be very effective with (as you said) plenty of training. I think there *may* be slight justification for canonicalizing text parts into utf-8 first; while you are losing information, it's hard to see how ζζΊθ²ζ should be treated differently depending on the character encoding. Regards, Dianne.