On Tue, Apr 20, 2010 at 1:37 PM, tsuraan <tsur...@gmail.com> wrote:
> The assets are binary files on a document tracking system.  Our
> current platform is postgres-backed; the entire system we've written
> is fairly easily distributed across multiple computers, but postgres
> isn't.  There are reliable databases that do scale out, but they tend
> to be a little on the pricey side...  Our current system works well in
> the tens to hundreds of millions of documents with hundreds of users,
> but we're hitting the billions of documents with thousands of users,
> so cassandra's scaling properties are pretty appealing there.

It seems to me you might get by with putting the actual assets into
cassandra (possibly breaking them up into chunks depending on how big
they are) and storing the pointers to them in Postgres along with all
the other metadata.  If it were me, I'd split each file into a fixed
chunksize and store it using its SHA1 checksum, and keep an ordered
list of chunks that make up a file, then never delete a chunk.  Given
billions of documents you just may end up with some savings due to
file chunks that are identical.

You could partition the postgres tables and replicate the data to a
handful of read-only nodes that could handle quite a bit of the work.
I suppose it depends on your write-frequency how that might pan out as
a scalability option.

Reply via email to