Am 31.05.2016 um 02:30 schrieb Bill Cole:
On 30 May 2016, at 18:25, Dianne Skoll wrote:On Mon, 30 May 2016 17:45:52 -0400 "Bill Cole" <sausers-20150...@billmail.scconsult.com> wrote:So you could have 'sex' and 'meds' and 'watches' tallied up in into frequency counts that sum up natural (word) and synthetic (concept) occurrences, not just as incompatible types of input feature but as a conflation of incompatible features.That is easy to patch by giving "concepts" a separate namespace. You could do that by picking a character that can't be in a normal token and using something like: concept*meds, concept*sex, etc. as tokens.Yes, but I'd still be reluctant to have that namespace directly blended with 1-word Bayes because those "concepts" are qualitatively different: inherently much more complex in their measurement than words. Robotic semantic analysis hasn't reached the point where an unremarkable machine can decide whether a message is porn or a discussion of current political issues, and I would not hazard a guess as to which actual concept in email is more likely to be spam or ham these days. Any old mail server can of course tell whether the word 'Carolina' is present in a message, which probably distributes quite disproportionately towards ham
was the difference having two token "hot" and "sex" versus 3 tokens "hot", "sex" and "hot sex" for bayes classification?
signature.asc
Description: OpenPGP digital signature