...or perhaps vice versa: how would I tweak a model to suit Cassandra? I have in mind data that could be _almost_ shoehorned into the (S)CF structure, and I'd love to hammer this nail with something hadoopy, but I have a niggling suspicion I'm setting myself up for frustration. I have * a relatively small set of primary tags (dozens or hundreds per cluster) * under each primary tag a large number (on the scale of 1E6) of arbitrary length hierarchical paths ("foo/bar/xyzzy", typically consisting of descriptive labels, usually totaling 20-40 chars) * under each path an arbitrary number (usually a few or a few dozen, but in some systematic cases ~1000) of leaf tags (typically descriptive labels, say 4-16 chars in length) * under each leaf tag a value (arbitrary; string, number, perhaps binary) On the surface, it would seem that the primary tag would correspond well with Supercolumn keys, the intermediate path with ColumnFamily names, and the final key-value-pairs with Columns. Any warning bells here? (Seems like I could also use the primary tag as a Keyspace name, but I seem to recall some warnings about using excessive keyspaces.) The gist is that each and every leaf tag, across the whole data set, receives a value every few seconds, indefinitely, and history must be preserved. In practice, all Columns in the ColumnFamily receive a value at the same time. 100k ColumnFamily updates a second would be routine. Nodes would be added whenever storage or per-node I/O became an issue. A query is a much more rare occurrence, and would nearly always involve retrieving the full contents of a ColumnFamily over some time range (usually thousands of snapshots, not at all rarely millions). Just by browsing online documentation I can't resolve whether this timestamping could work in conjunction with Cassandra's internal native timestamping as is. Is it possible - out of the box, or with minor coding - to retain history, and to incorporate time ranges into queries? If so, does the distributed storage cope with the accumulating data of a single ColumnFamily flowing over to new nodes? Or: should I twist the whole thing around and incorporate my timestamp into the ColumnFamily identifier to enjoy automatic scaling? Would the sheer number of resulting identifiers become a performance issue? Thanks for your comments; //e