> It seems to me you might get by with putting the actual assets into > cassandra (possibly breaking them up into chunks depending on how big > they are) and storing the pointers to them in Postgres along with all > the other metadata. If it were me, I'd split each file into a fixed > chunksize and store it using its SHA1 checksum, and keep an ordered > list of chunks that make up a file, then never delete a chunk. Given > billions of documents you just may end up with some savings due to > file chunks that are identical.
The retrieval of documents is pretty key (people like getting their files), so we store them on disk and use our http server's static file serving to send them out. I'm not sure what the best way to serve files stored in cassandra would be, but the free replication offered is interesting. Is cassandra a sane way to store huge amounts (many TB) of raw data? I saw in the limitations page that people are using cassandra to store files, but is it considered a good idea? > You could partition the postgres tables and replicate the data to a > handful of read-only nodes that could handle quite a bit of the work. > I suppose it depends on your write-frequency how that might pan out as > a scalability option. Our system is pretty write-heavy; we currently do a bit under a million files a day (which translates to about 5x number of db records stored), but we're going for a few million per day. Here's a quick question that should be answerable: If I have a CF with SuperColumns where one of the SuperColumns has keys that are users allowed to see an asset, is it guaranteed to be safe to add keys to that SuperColumn? I noticed that each column has its own timestamp, so it doesn't look like I actually need to write a full row (which would introduce overwriting race-condition concerns). It looks like I can just use batch_mutate to add the keys that I want to the permissions SuperColumn. Is that correct, and would that avoid races?