Am 31.05.2016 um 02:30 schrieb Bill Cole:
On 30 May 2016, at 18:25, Dianne Skoll wrote:

On Mon, 30 May 2016 17:45:52 -0400
"Bill Cole" <sausers-20150...@billmail.scconsult.com> wrote:

So you could have 'sex' and 'meds' and 'watches' tallied up in into
frequency counts that sum up natural (word) and synthetic (concept)
occurrences, not just as incompatible types of input feature but as
a conflation of incompatible features.

That is easy to patch by giving "concepts" a separate namespace.  You
could do that by picking a character that can't be in a normal token and
using something like:  concept*meds, concept*sex, etc. as tokens.

Yes, but I'd still be reluctant to have that namespace directly blended
with 1-word Bayes because those "concepts" are qualitatively different:
inherently much more complex in their measurement than words. Robotic
semantic analysis hasn't reached the point where an unremarkable machine
can decide whether a message is porn or a discussion of current
political issues, and I would not hazard a guess as to which actual
concept in email is more likely to be spam or ham these days. Any old
mail server can of course tell whether the word 'Carolina' is present in
a message, which probably distributes quite disproportionately towards ham

was the difference having two token "hot" and "sex" versus 3 tokens "hot", "sex" and "hot sex" for bayes classification?

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to