Hi,

since so many have problems i share my mysql shemas :=)
    `token` binary(5) NOT NULL,

Yes, the binary or varbinary is the key to a solution here.
Mucking with utf-8 vs latin-1 is just covering but not solving
the most glaring problem here, namely that a token must not be
associated with any character set, as it does not obey any
such rules, nor should it be treated case-insensitively
(as char is, which is possibly a reason for more than two
record changes as reported by Dave). Will take a closer look...

I changed the "Type=MyISAM" at the end of each CREATE statement in the original schema and replaced it with the following from Benny's schema:

ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

It's now working, but is excruciatingly slow. Is this also just covering the problem, or will this be a usable solution when it finally finishes?

Is there a difference whether I learn as MyISAM then convert to InnoDB after it finishes? I could train it using original spam/ham, but I fear it will be equally as slow and obviously a more difficult process to hand-scan for corpus again.

Thanks,
Dave

Reply via email to