Don't overrate Bayes. Don't focus solely on a bullet-proof highly
available clustered or replicated database. If the Bayes database is
gone, only one check is gone! All the others are still there.
For my mail content, the real filtering power today come from the
network checks such as url-blocklists, content-checksums (razor/dcc) and
open-relay block lists. Focus on making these additional tests work.
For Bayes, use a central SQL database on one server that is used by all
your MTA's, and keep it simple. Make a disaster recovery concept for the
database machine and for the rebuild of an empty SA Bayes database. This
could be very fast. Don't backup the Bayes token data. You wrote that
you expect 500.000 messages per day. If you use Bayes auto-learning, an
empty central Bayes database is refilled to a usable state from current
messages in only a few hours. This is probably faster than a cumbersome
restore process.
regards,
Alex