On Mon, 30 May 2016 17:45:52 -0400
"Bill Cole" <sausers-20150...@billmail.scconsult.com> wrote:

> So you could have 'sex' and 'meds' and 'watches' tallied up in into
> frequency counts that sum up natural (word) and synthetic (concept)
> occurrences, not just as incompatible types of input feature but as
> a conflation of incompatible features.

That is easy to patch by giving "concepts" a separate namespace.  You
could do that by picking a character that can't be in a normal token and
using something like:  concept*meds, concept*sex, etc. as tokens.

> FWIW, I have roughly no free time for anything between work and
> family demands but if I did, I would most like to build a blind
> fixed-length tokenization Bayes classifier: just slice up a message
> into all of its n-byte sequences (so that a message of bytelength x
> would have x-(n-1) different tokens) and use those as inputs instead
> of words.

I think that could be very effective with (as you said) plenty of
training.  I think there *may* be slight justification for
canonicalizing text parts into utf-8 first; while you are losing
information, it's hard to see how ζ‰‹ζœΊθ‰²ζƒ… should be treated
differently depending on the character encoding.

Regards,

Dianne.

Reply via email to