Re: Recommandation on how to organize CF

aaron morton Sun, 29 May 2011 04:21:36 -0700

I often suggest people think about using something like JSON for data the looks 
relatively unchanging, or looks like it is always worked on as a single entity 
for a couple of reasons.


1. Cassandra does not need to know about every atomic piece of data in your 
model. Obviously there are some good application reasons to store things in 
columns, such as TTL, slice ranges, etc etc. Blobing data was generally a bad 
thing to do in a RDBMS, but IMHO it's a valid option in cassandra. 
 2. For every column value you store in cassandra you also store the column 
name, timestamp and some other bytes. This is the price you pay for a schema 
free DB. So there can be an unexpected storage (and network) bloat if you are 
storing lots of small values in lots of columns. Whether you consider this 
expensive has to do with how much you like running ALTER TABLE statements.
 3. IMHO there is little difference to code been written to detect if a 
cassandra row or a JSON dict does not contain a column because it was created 
before the last code release. Adding attributes to your entity is still a code 
only change and you only need to update old data if your business problem 
requires it.

There are also a number of reasons not to do it:

1. It does not pass your smell test. 
2. You have multiple agents updating the entity with no look writes.
3. You want to pull back parts of the entity, do slices, use TTL, secondary 
indexes etc etc. 
4. You work cross platform, use brisk/hadoop, use hive/pig and it's a pain for 
everyone. 

I agree it's not for every situation and it probably makes sense to start 
coding without it to begin with. But I think it is worth considering in some 
cases. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 26 May 2011, at 02:57, openvictor Open wrote:

> Thanks Aaron,
> 
> Sorry I didn't see your message sooner.
> 
> So the CF Messages using UTF8Type holds the  information such as : who has 
> the right to read/ is it possible to answer to this list etc... There are two 
> "kinds" of keys. The keys which begin by : "message:uuid" and the 
> "messagelist:uuid". A column of message:uuid is for example "sender" or 
> "rawtext". A column of messagelist:uuid is for example : "creator" or 
> "participants".
> 
> 
> MessagesTime (message_time) is the sorting mechanism, meaning when I request 
> against message_time I get messages or messagelists in the order it was sent. 
> There are 2 kinds of keys :
> "messagebox:someone" : Each Column is for the Value : the uuid of a list 
> inside the messagebox of someone, for the Name : the uuid of the last message 
> in the corresponding messagelist. It gives me a sorting mechanism based on 
> the last message received.
> "messagelist:uuid" : Each Column has for its Name : the UUID of a message and 
> for the Value : whatever it doesn't really care.
> 
> About your suggestion, is a very good solution but there is one thing I don't 
> really like with serialization : it "blocks" evolution. Let's say I would 
> like to add one field to a message because I want to add a field, I am 
> obliged to make a tool to deserialize, add the information  reserialize all 
> the fields and insert. Even if I serialize with JSON it looks like evolution 
> (that is why I chose Cassandra) is a little bit broken.If I am wrong, please 
> tell me so. 
> However I will explore this very interesting possibility for another project 
> with "tags", which is not really subject to dramatic evolutions.
> 
> At the moment I don't really complain about speed and since it is not really 
> time critical (after all who cares if the messagebox loads in 250 ms instead 
> of 200ms). At the moment I get the messages with two batch Cassandra calls so 
> I think this is satisfying.
> 
> Thanks again, the json serialization looks like a very interesting 
> possibility.
> 
> Victor
> 
> 2011/5/19 aaron morton <aa...@thelastpickle.com>
> I'm a bit confused by your examples. I think you are saying...
> 
> - Standard CF called Message using the UTF8Type for column comparisons used 
> to store the individual messages. Row key is the message UUID. Not sure what 
> the columns are.
> - Standard CF called MessageTime using TimeUUIDType for columns comparison 
> uses to store collections of messages. Row key is 
> "messagelist:<message_list_uuid>" for a message list, and 
> "messagebox:<user_name>:<mbox_name>" for message box. Not sure what the 
> columns are.
> 
> The best model is going to be the one that supports your read requests and 
> the volume of data your are expecting.
> 
> One way to go is to de normalise to support very fast read paths. You could 
> store the entire message in one column using something like JSON to serialise 
> it. Then
> 
> - MessageIndexes standard CF to store the full messages in context, there are 
> three different types of rows:
>        * keys with <user_name>  store all messages for a user, column name is 
> the message TimeUUID and value is the message structure
>        * keys with <user_name>/<mbox_name> store the messages for a single 
> message box. Columns same as below.
>        * keys with <user_name>/<mbox_name>/<mlist_name> store the messages in 
> a single message list. Columns as above.
> 
> - MessageFolders CF to store the message box and message lists, two 
> approaches:
>        1) <user_name> as key and each column is a message box, message lists 
> are stored in a single column as JSON
>        2) <user_name> row for the top level message box, column for each 
> message box. <user_name>/<message_box> for the next level,
> 
> Or if space is a concern just store the UUID of the message in the index CF 
> and add a CF to store the messages.
> 
> It also going to depend on the management features, e.g. can you rename a 
> message box / list ? Move messages around ? If so the de normalised pattern 
> may not be the best as those operations will take longer.
> 
> Hope that helps.
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 19 May 2011, at 05:44, openvictor Open wrote:
> 
> > Hello all,
> >
> > I know organization is a broad topic and everybody may have an idea on how 
> > to do it, but I really want to have some advices and opinions and I think 
> > it could be interesting to discuss this matter.
> >
> > Here is my problem: I am designing a messaging system internal to a 
> > website. There are 3 big structures which are Message, MessageList, 
> > MessageBox. A message/messagelist is identified only by an UUID; a 
> > MessageBox is identified by a name(utf8 string). A messagebox has a set of 
> > MessageList in it and a messagelist has a set of message in it, all of them 
> > being UUIDs.
> > Currently I have only two CF : message and message_time. Message is a 
> > UTF8Type (cassandra 0.6.11, soon going for 0.8) and message_time is a 
> > TimeUUIDType.
> >
> > For example if I want to request all message in a certain messagelist I do 
> > : message_time['messagelist:uuid(messagelist)']
> > If I want information of a mesasge I do message['message:uuid(message)']
> > If I want all messagelist for a certain messagebox ( called nameofbox for 
> > user openvictor for this example) I do : 
> > message_time['messagebox:openvictor:nameofbox']
> >
> > My question to Cassandra users is : is it a good idea to regroup all those 
> > things into two CF ? Is there some advantages / drawbacks of this two CFs 
> > and for long term should I change my organization ?
> >
> > Thank you,
> > Victor
> 
>

Re: Recommandation on how to organize CF

Reply via email to