Hi,
since so many have problems i share my mysql shemas :=)
`token` binary(5) NOT NULL,
Yes, the binary or varbinary is the key to a solution here.
Mucking with utf-8 vs latin-1 is just covering but not solving
the most glaring problem here, namely that a token must not be
associated with any character set, as it does not obey any
such rules, nor should it be treated case-insensitively
(as char is, which is possibly a reason for more than two
record changes as reported by Dave). Will take a closer look...
I changed the "Type=MyISAM" at the end of each CREATE statement in the
original schema and replaced it with the following from Benny's schema:
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
It's now working, but is excruciatingly slow. Is this also just covering
the problem, or will this be a usable solution when it finally finishes?
Is there a difference whether I learn as MyISAM then convert to InnoDB
after it finishes? I could train it using original spam/ham, but I fear
it will be equally as slow and obviously a more difficult process to
hand-scan for corpus again.
Thanks,
Dave