organizing data flow in and out of my cluster

Alexandru Dan Sicoe Fri, 09 Dec 2011 07:12:56 -0800

Hi,

I am thinking of strategies to deploy my application that uses a 3 node
Cassandra cluster.


Quick recap: I have several client applications that feed in about 2
million different variables (each representing a different monitoring
value/channel) in Cassandra. The system receives updates for each of these
monitoring values at different rates. For each new update, the timestamp
and value are recorded in a Cassandra name-value pair. The schema of
Cassandra is built using one CF for data and 4 other CFs for metadata. The
data CF uses rows as 4 hour time bins. The system can currently sustain the
insertion load. Now I'm looking into retreival performance for random
queries and organizing the flow of data in and out of the cluster.

The main concern at the moment is about organizing the flow of data in and
out of the cluster. Why do I need to organize the data out? Well, my
requirement is to keep all the data coming into the system at the highest
granularity for long term (several years). The 3 node cluster I mentioned
is the online cluster which is supposed to be able to absorb the input load
for a relatively short period of time, a few weeks. After this period the
data has to be shipped out of the cluster in a mass storage facility and
the cluster needs to be emptied to make room for more data. Also, the
online cluster will also serve reads while it takes in data.

One solution would be to stop the system every few weeks and export the
data and then truncate the CFs and then start taking data again. In a few
weeks a lot of data will be accumulated - hundreds of GBytes which makes
the two operations lengthy and error prone. The problem is that the system
cannot afford downtime. So I am looking for solutions to keep the online
systems taking data and serving reads without being affected too much about
exporting data out and truncating.

As DataStax splits the cluster in an online and offline part, I am thinking
of having 2 nodes in one data center (DC_X) and the 3rd node in the other
datacenter (DC_Y). The clients will be writing to all 3 nodes. Using a
replication factor 2 ensures that replicas of the nodes in DC_X will always
be sent to the node in DC_Y. That means that the cluster will be unbalanced
but that's fine cause the node in DC_Y will contain all the data in the
system. From time to time I can export the data in this node outside -
which means that it's performance will go down a lot.  Will the system be
able to sustain the exporting of all data from node in DC_Y from time to
time?  After I finish exporting I will want to emty the data in the
cluster. How about truncating the CFs? Can I truncate the CFs while the 3
nodes are in operation? Will this affect performance a lot? - I know it's
probably dependent on data size....how to go about this?

Cheers,
Alex

organizing data flow in and out of my cluster

Reply via email to