Re: Cassandra Database Modeling

aaron morton Mon, 11 Apr 2011 15:44:45 -0700

The tricky part here is the level of flexibility you want for the querying. In 
general you will want to denormalise to support the read queries.

If your queries are not interactive you may be able to use Hadoop / Pig / Hive 
e.g. http://www.datastax.com/products/brisk In which case you can probably have 
a simpler data model where you spend less effort supporting the queries. But it 
sounds like you need interactive queries as part of the experiment.

You could store the data per pair in a standard CF (lets call it the pair cf) 
as follows:

- key: expriement_id.time_interval
- column name: pair_id
- column value: distance, angle, other data packed together as JSON or some 
other format

This would support a basic record of what happened, for each time interval you 
can get the list of all pairs and read their data. 

To support your spatial queries you could use two standard standard CFs as 
follows:

distance CF:
- key: experiment_id.time_interval
- colunm name: zero_padded_distance.pair_id
- column value: empty or the angle 

angle CF :
- key: experiment_id.time_interval
- colunm name: zero_padded_angle.pair_id
- column value: empty or the distance

(two pairs can have the same distance and/or angle in same time slice)

Here we are using the column name as a compound value, and am assuming they can 
be byte ordered. So for distance the column name looks something like 
000500.123456789. You would then use the Byte comparator (or similar) for the 
columns.  

To find all of the particles for experiment 2 at t5 where distance is < 100 you 
would use a get_slice (see http://wiki.apache.org/cassandra/API or your higher 
level client docs) against the key "2.5" with a SliceRange start at 
"000000.000000000" and finish at "000100.999999999". Once you have this list of 
columns you can either filter client side for the angle or issue another query 
for the particles inside the angle range. Then join the two results client side 
using the pair_id returned in the column names. 

By using the same key for all 3 CF's all the data for a time slice will be 
stored on the same nodes. You can potentially spread this around by using 
slightly different keys so they may hash to different areas of the cluster. 
e.g. expriement_id.time_interval."distance"

Data volume is not a concern, and it's not possible to talk about performance 
until you have an idea of the workload and required throughput. But writes are 
fast and I think your reads would be fast as well as the row data for distance 
and angle will not change so caches will be be useful. 

Hope that helps. 
Aaron

On 12 Apr 2011, at 03:01, Shalom wrote:

> I would like to save statistics on 10,000,000 (ten millions) pairs of
> particles, how they relate to one another in any given space in time.
> 
> So suppose that within a total experiment time of T1..T1000 (assume that T1
> is when the experiment starts, and T1000 is the time when the experiment
> ends) I would like, per each pair of particles, to measure the relationship
> between every Tn -- T(n+1) interval:
> 
> T1..T2 (this is the first interval)
> 
> T2..T3
> 
> T3..T4
> 
> ......
> 
> ......
> 
> T9,999,999..T10,000,000 (this is the last interval)
> 
> For each such a particle pair (there are 10,000,000 pairs) I would like to
> save some figures (such as distance, angel etc) on each interval of [
> Tn..T(n+1) ]
> 
> Once saved, the query I will be using to retrieve this data is as follows:
> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
> distance between the two particles is smaller than X and the angle between
> the two particles is greater than Y". Meaning, the query will always take
> place for all particle pairs on a certain interval of time.
> 
> How would you model this in Cassandra, so that the writes/reads are
> optimized? given the database size involved, can you recommend on a suitable
> solution? (I have been recommended to both MongoDB / Cassandra).
> 
> I should mention that the data does change often -- we run many such
> experiments (different particle sets / thousands of experiments) and would
> need a very decent performance of reads/writes.
> 
> Is Cassandra suitable for this time of work?
> 
> 
> --
> View this message in context: 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
> Nabble.com.

Re: Cassandra Database Modeling

Reply via email to