Re: General questions about Cassandra

Chris Gerken Fri, 17 Feb 2012 08:59:34 -0800

In response to an offline question…

There are two usage patterns for Cassandra column families, static and dynamic. 
 With both approaches you store objects of a given type into a column family.

With static usage the object type you're persisting has a single key and each 
row in the column family maps to a single object.  The value of an object's key 
is stored in the row key and each of the object's properties is stored in a 
column whose name is the name of the property and whose value is the property 
value.  There are the same number of columns in a row as there are non-null 
property values. This usage is very much like traditional relational database 
usage.

With dynamic usage the object type to be persisted has two keys (I'll get to 
composite keys in a bit).  With this approach the value of an object's primary 
key is stored as a row key and the entire object is stored in a single column 
whose name is the value of the object's secondary key and whose value is the 
entire object (serialized into a ByteBuffer). This results in persisting 
potentially many objects in a single row.  All of those objects have the same 
primary key and there are as many columns as there are objects with the same 
primary key.  An example of this approach is a time series column family in 
which each row holds weather readings for a different city and each column in a 
row holds all of the weather observations for that city at a certain time.  The 
timestamp is used as a column name and an object holding all the observations 
is serialized and stored in the corresponding column value.

Cassandra is a really powerful database, but it excels performance-wise with 
reading and writing time series data stored using a dynamic column family.

There are variations of the above patterns.  You can use composite types to 
define a row key or column name that are made up of values of multiple keys, 
for example.

I gave a presentation on the topic of Cassandra patterns recently to the Austin 
Cassandra Meetup.  You can find my charts there in the archives or posted to my 
box at the linkedin site below…. or contact me offline.

To bring this back to the original question.  Asking for the ability to apply a 
Java method to selected rows makes sense for static column families, but I 
think the more general need is to be able to apply a Java method to selected 
persisted objects in a column family regardless of static or dynamic usage.  
While I'm on my soapbox, I think this requirement applies to Pig support as 
well.

thx

Chris Gerken

chrisger...@mindspring.com
512.587.5261
http://www.linkedin.com/in/chgerken

On Feb 17, 2012, at 10:07 AM, Chris Gerken wrote:

> Don,
> 
> That's a good idea, but you have to be careful not to preclude the use of 
> dynamic column families (e.g. CF's with time series-like schemas) which is 
> what Cassandra's best at.  The right approach is to build your own 
> "ORM"/persistence layer (or generate one with some tools) that can hide the 
> API differences between static and dynamic CF's.  Once you're there, hadoop 
> and Pig both come very close to what you're asking for.
> 
> In other words, you should be asking for a means to apply a Java method to 
> selected objects (not rows) that are persisted in a Cassandra column family.
> 
> thx
> 
> - Chris
> 
> Chris Gerken
> 
> chrisger...@mindspring.com
> 512.587.5261
> http://www.linkedin.com/in/chgerken
> 
> 
> 
> On Feb 17, 2012, at 9:35 AM, Don Smith wrote:
> 
>> Are there plans to build-in some sort of map-reduce framework into Cassandra 
>> and CQL?   It seems that users should be able to apply a Java method to 
>> selected rows in parallel  on the distributed Cassandra JVMs.   I believe 
>> Solandra uses such an integration.
>> 
>> Don
>> ________________________________________
>> From: Alessio Cecchi [ales...@skye.it]
>> Sent: Friday, February 17, 2012 4:42 AM
>> To: user@cassandra.apache.org
>> Subject: General questions about Cassandra
>> 
>> Hi,
>> 
>> we have developed a software that store logs from mail servers in MySQL,
>> but for huge enviroments we are developing a version that store this
>> data in HBase. Raw logs are, once a day, first normalized, so the output
>> is like this:
>> 
>> username,date of login, IP Address, protocol
>> username,date of login, IP Address, protocol
>> username,date of login, IP Address, protocol
>> [...]
>> 
>> and after inserted into the database.
>> 
>> As I was saying, for huge installation (from 1 to 10 million of logins
>> per day, keep for 12 months) we are working with HBase, but I would also
>> consider Cassandra.
>> 
>> The advantage of HBase is MapReduce which makes searching the logs very
>> fast by splitting the "query" concurrently on multiple hosts.
>> 
>> Query will be launched from a web interface (will be few requests per
>> day) and the search keys are user and time range.
>> 
>> But Cassandra seems less complex to manage and simply to run, so I want
>> to evaluate it instead of HBase.
>> 
>> My question is, can also Cassandra split a "query" over the cluster like
>> MapReduce? Reading on-line Cassandra seems fast in insert data but
>> slower than HBase to "query". Is it really so?
>> 
>> We want not install Hadoop over Cassandra.
>> 
>> Any suggestion is welcome :-)
>> 
>> --
>> Alessio Cecchi is:
>> @ ILS ->  http://www.linux.it/~alessice/
>> on LinkedIn ->  http://www.linkedin.com/in/alessice
>> Assistenza Sistemi GNU/Linux ->  http://www.cecchi.biz/
>> @ PLUG ->  ex-Presidente, adesso senatore a vita, http://www.prato.linux.it
>> @ LOLUG ->  Socio http://www.lolug.net
>> 
>

Re: General questions about Cassandra

Reply via email to