> It seems to me you might get by with putting the actual assets into
> cassandra (possibly breaking them up into chunks depending on how big
> they are) and storing the pointers to them in Postgres along with all
> the other metadata.  If it were me, I'd split each file into a fixed
> chunksize and store it using its SHA1 checksum, and keep an ordered
> list of chunks that make up a file, then never delete a chunk.  Given
> billions of documents you just may end up with some savings due to
> file chunks that are identical.

The retrieval of documents is pretty key (people like getting their
files), so we store them on disk and use our http server's static file
serving to send them out.  I'm not sure what the best way to serve
files stored in cassandra would be, but the free replication offered
is interesting.  Is cassandra a sane way to store huge amounts (many
TB) of raw data?  I saw in the limitations page that people are using
cassandra to store files, but is it considered a good idea?

> You could partition the postgres tables and replicate the data to a
> handful of read-only nodes that could handle quite a bit of the work.
> I suppose it depends on your write-frequency how that might pan out as
> a scalability option.

Our system is pretty write-heavy; we currently do a bit under a
million files a day (which translates to about 5x number of db records
stored), but we're going for a few million per day.

Here's a quick question that should be answerable:  If I have a CF
with SuperColumns where one of the SuperColumns has keys that are
users allowed to see an asset, is it guaranteed to be safe to add keys
to that SuperColumn?  I noticed that each column has its own
timestamp, so it doesn't look like I actually need to write a full row
(which would introduce overwriting race-condition concerns).  It looks
like I can just use batch_mutate to add the keys that I want to the
permissions SuperColumn.  Is that correct, and would that avoid races?

Reply via email to