> - basically the task is to create a distributed file system with only few 
> allowed operations
Background http://www.datastax.com/dev/blog/cassandra-file-system-design

> - it should handle split brain conditions well (two or three DCs, it's 
> possible to get user requests while the intra-DC connections are down)
Handling entity creation in disconnected DC's can be a little tricky. if it's 
possible for multiple entities with the same "natural key" / "path" to be 
created you can either have last write wins or handle it at the application 
level. To do that you need to find a way for code in each DC to write to a 
different place / column. Then resolve the differences in application at read 
time. 

> All (minus the last) should be close to ACID.
Nope. 
We have Atomicity, Isolation and Durability. Not going to get Consistency in 
the ACID sense. See the first presentation here 
http://www.datastax.com/events/cassandrasummit2012/presentations

> This brought me the following schema:
Take a look at the one linked at the top. 
You probably want to disconnect the file contents from the directory listing. 
And if listing directory contents is a hot code path I would consider 
supporting that by having a directory CF. 
 
> I could add more CF(s) for example with composite columns (each directory 
> being a row), but that would destroy the benefit of the above schema for 
> moving files in one (atomic and idempotent) operation. 
If you store the directory listing differently from the file moving it is a 
(essentially) meta data operation. 
Any write is atomic in the sense that either all changes commit or none do. And 
they writes other than counter increments are idempotent. 

For what it's worth, Riak has an out of the box S3 compatible service 
http://basho.com/products/riakcs/. Not sure what the pricing is. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/01/2013, at 4:46 AM, Attila Nagy <b...@fsn.hu> wrote:

> Hi,
> 
> I'm new to Cassandra, so I wonder what would be the most efficient(*) schema 
> for specific operations on the following dataset:
> - basically the task is to create a distributed file system with only few 
> allowed operations
> - it should handle split brain conditions well (two or three DCs, it's 
> possible to get user requests while the intra-DC connections are down)
> - each data file is blob (pretty often, but not always 8 bit text, sometimes 
> without the encoding known), ranging from a few 100 bytes to around 20-30 MB. 
> Average size is 350 kB.
> - it's well compressible (gzip -1 gives 1.63x compress ratio on plain files)
> - files are mostly immutable
> - for the few, which are not, they are append-only
> - each file has a unique file name (size in the range of 70-100 bytes) and 
> it's preserved during its lifetime (if it changes, a rewrite is OK).
> - the files have metadata attached (various, 1.2's sets and maps are a good 
> fit here, but even simple columns should do)
> - the files are organized into directories (multi level, and sometimes there 
> can be up to some millions of files in a dir, but more likely are the range 
> of 0-some hundred, thousands (up to 10k))
> - directories also have metadata (most notably an mtime, which changes when 
> directory contents are changed, that can be used to cache directory lists)
> - each directory (and the files therein) belongs to a user(name)
> - Cassandra 1.2 is fine
> 
> Given that designing schema for Cassandra begins with listing the operations, 
> here they are:
> 1. get the contents of a directory (input: directory (owner) name, 
> output:file names)
> 2. get a file (in: file (dir) name, out:file metadata and contents)
> 3. put a file (in: file (dir, owner) name, metadata, data)
> 4. append to a file, chunk size maximum is 4-8 kB (in: file (dir, owner) 
> name, data)
> 5. move a file to a different directory (in: file, dir (owner) name, target 
> dir)
> 6. remove a file (in: file (dir, owner) name)
> 7. remove all stuff for a user (input: user name), but it's "rare" (compared 
> to the above), so walking through the dirlist on the client side is OK, it's 
> not performance critical
> 
> All (minus the last) should be close to ACID.
> 
> I've tried to do the homework (taken a look at Cassandra in the 0.6-7 times, 
> so now I'm le-learning the new (1.2, CQL 3) way) and still couldn't find the 
> best way.
> 
> I thought the best would be to start with the docs, without any preliminary 
> performance testing.
> 
> This brought me the following schema:
> CREATE TABLE file (
>   name varchar PRIMARY KEY,
>   owner varchar,
>   dir varchar,
>   flags set<ascii>,
>   fstat map<ascii, int>,
>   data list<blob>
> );
> CREATE INDEX file_dir ON file (dir);
> CREATE INDEX file_owner ON file (owner);
> 
> Which gives me for the operations:
> 1. something like SELECT name(,etc) FROM file WHERE dir="dirname"; Which can 
> be LIMIT-ed. Problems: SLOOOW (and maybe despite the LIMIT, it materializes 
> in the coordinator's memory, I don't know), also, doesn't scale, because all 
> nodes must inspect their index CFs.
> 2. a simple SELECT data(,etc) FROM file WHERE name="filename"; Problems: 
> Cassandra is said not to good at storing such amounts of data, it has to read 
> all in memory (on the coordinator and the replica node), also the client will 
> have to hold it in memory. But it seems to be acceptible to some levels, 
> because all data is     needed, so fetching it once is an optimization. The 
> limiting factor here seems to be the network speed (nodes pass the data as a 
> hot potato, the slower the network, the longer it has to be kept in memory), 
> and CPU speed.
> 3. a simple INSERT INTO or UPDATE
> 4. 1.2's lists make appends easy
> 5. most important. It's a manner of UPDATE file SET dir VALUES("newdir") 
> WHERE name="filename"; This either happens, or not, there is no situation 
> where the file is in multiple directories, or nowhere. Even if a site (node) 
> doesn't get it, it still sees the file on the old location, and if there is a 
> move there, eventually everything will get into the right shape, without 
> worrying needed on the client side.
> 6. a simple DELETE
> 7. SELECT and DELETEs
> 
> For most operations, it seems fine (but your insightful recommendations are 
> welcome :), the biggest pain seems to be listing directories.
> It takes seconds from minutes (or timeouts).
> I could add more CF(s) for example with composite columns (each directory 
> being a row), but that would destroy the benefit of the above schema for 
> moving files in one (atomic and idempotent) operation. 
> 
> Is there a schema, which can do best for all operations and still maintain 
> ACID-like properties?
> 
> Thanks,
> 
> *: in terms of query efficiency, closeness to ACID

Reply via email to