Re: High Performance Bayes Database Configuration?

David F. Skoll Tue, 21 Jun 2011 12:18:01 -0700

On Tue, 21 Jun 2011 20:03:57 +0100
Dominic Benson <domi...@lenny.cus.org> wrote:


> To be fair to MySQL, these days it is pretty solid. There are
> potentially dangerous configuration options, but there are in
> Postgres too, and you can turn them off. Have you had a bad
> experience with a recent version?

No, not really, but MySQL is broken in so many ways I try to stay away
from it.  Many of the design flaws in http://sql-info.de/mysql/gotchas.html
remain unfixed.  For example, even in MySQL 5.0.5, 'select 1/0;'
returns NULL.  PostgreSQL more sensibly raises an exception.  And
while 5.0.5 no longer lets you insert '2003-02-31' into a DATE field,
the INSERT command does not fail.  A SELECT gives you back 0000-00-00.

Hence: I do not trust MySQL with my data.  (If an INSERT followed by a
SELECT does not give me back exactly what I inserted, then the INSERT
command *MUST FAIL* for me to trust the DB.)

[...]

> In the absence of writes, even MyISAM won't cause locking problems;
> that said I can easily see that CDB would be faster. My question is
> why does the speed matter, rather than the overall capacity?

When you are scanning 5-10 million messages/day as some of our
installations do, speed matters.

> Surely the extra fraction of a millisecond is insignificant in the
> passage of the message.

Well, it's not "fractions of a millisecond".  For an email with a
couple of hundred tokens, it can be a couple of milliseconds.  When
1000 processes doing Bayes lookups are hitting the database all at the
same time, it can be more than just a couple of milliseconds.  And
that can be enough to require more concurrent scanning processes, more
memory, etc, etc.  It really does matter on busy systems.

[...]
> Very true. But with tiny datasets like these, it's all in memory
> anyway - and given the read-almost-entirely workload, SQL replication
> works rather well. Indeed, given how small and how well, it is
> reasonable to have a server-local replica just as you do with CDB.

True.  However, CDB is more suitable for simple key/value lookups than
a SQL database.  For this particular data set and workload, SQL makes
no sense.  Even TCP or UNIX-domain socket connections to a DB server
are overkill.

Regards,

David.

Re: High Performance Bayes Database Configuration?

Reply via email to