Re: Cassandra Database Modeling

aaron morton Tue, 12 Apr 2011 15:15:00 -0700

Yes for  interactive == real time queries.  Hadoop based techniques are non 
time critical queries, but they do have greater analytical capabilities.


particle_pairs:
1) Yes and no and sort of. Under the hood the get_slice api call will be used 
by your client library to pull back chunks of (ordered) columns. Most client 
libraries abstract away the chunking for you. 

2) If you are using a packed structure like JSON then no, Cassandra will have 
no idea what you've put in the columns other than bytes . It really depends on 
how much data you have per pair, but generally it's easier to pull back more 
data than try to get exactly what you need. Downside is you have to update all 
the data. 

3) No, you would need to update all the data for the pair. I was assuming most 
of the data was written once, and that your simulation had something like a 
stop-the-world phase between time slices where state was dumped and then read 
to start the next interval. You could either read it first, or we can come up 
with something else.

distance_cf
1) the query would return an list of columns, which have a name and value (as 
well as a timestamp and ttl).
2) depends on the client library, if using python go for 
https://github.com/pycassa/pycassa It will return objects 
3) returning millions of columns is going to be slow, would also be slow using 
a RDBMS. Creating millions objects in python is going to be slow. You would 
need to have a better idea of what queries you will actually want to run to see 
if it's *too* slow. If it is one approach is to store the particles at the same 
distance in the same column, so you need to read less columns. Again depends on 
how your sim works. 
  
Time complexity depends on the number of columns read. Finding a row will not 
be O(1) as it it may have to read from several files. Writes are more constant 
than reads. But remember, you can have a lot of io and cpu power in your 
cluster.

Best advice is to jump in and see if the data model works for you at a small 
single node scale, most performance issues can be solved. 

Aaron

On 12 Apr 2011, at 15:34, csharpplusproject wrote:

> Hi Aaron,
> 
> Yes, of course it helps, I am starting to get a flavor of Cassandra -- thank 
> you very much!
> 
> First of all, by 'interactive' queries, are you referring to 'real-time' 
> queries? (meaning, where experiments data is 'streaming', data needs to be 
> stored and following that, the query needs to be run in real time)?
> 
> Looking at the design of the particle pairs:
> 
> - key: expriement_id.time_interval 
> - column name: pair_id 
> - column value: distance, angle, other data packed together as JSON or some 
> other format
> 
> A couple of questions:
> 
> (1) Will a query such as pairID[ expriement_id.time_interval ] will basically 
> return an array of all paidIDs for the experiment, where each item is a 
> 'packed' JSON?
> (2) Would it be possible, rather than returning the whole JSON object per 
> every pairID, to get (say) only the distance?
> (3) Would it be possible to easily update certain 'pairIDs' with new values 
> (for example, update pairIDs = {2389, 93434} with new distance values)? 
> 
> Looking at the design of the distance CF (for example):
> 
> this is VERY INTERESTING. basically you are suggesting a design that will 
> save the actual distance between each pair of particles, and will allow 
> queries where we can find all pairIDs (for an experiment, on time_interval) 
> that meet a certain distance criteria. VERY, VERY INTERESTING!
> 
> A couple of questions:
> 
> (1) Will a query such as distanceCF[ expriement_id.time_interval ] will 
> basically return an array of all 'zero_padded_distance.pair_id' elements for 
> the experiment?
> (2) In such a case, I will get (presumably) a python list where every item is 
> a string (and I will need to process it)?
> (3) Given the fact that we're doing a slice on millions of columns (?), any 
> idea how fast such an operation would be?
> 
> 
> Just to make sure I understand, is it true that in both situations, the query 
> complexity is basically O(1) since it's simply a HASH?
> 
> 
> Thank you for all of your help!
> 
> Shalom.
> 
> -----Original Message-----
> From: aaron morton <aa...@thelastpickle.com>
> Reply-to: user@cassandra.apache.org
> To: user@cassandra.apache.org
> Subject: Re: Cassandra Database Modeling
> Date: Tue, 12 Apr 2011 10:43:42 +1200
> 
> The tricky part here is the level of flexibility you want for the querying. 
> In general you will want to denormalise to support the read queries.   
> 
> If your queries are not interactive you may be able to use Hadoop / Pig / 
> Hive e.g. http://www.datastax.com/products/brisk In which case you can 
> probably have a simpler data model where you spend less effort supporting the 
> queries. But it sounds like you need interactive queries as part of the 
> experiment. 
> 
> You could store the data per pair in a standard CF (lets call it the pair cf) 
> as follows: 
> 
> - key: expriement_id.time_interval - column name: pair_id - column value: 
> distance, angle, other data packed together as JSON or some other format 
> 
> This would support a basic record of what happened, for each time interval 
> you can get the list of all pairs and read their data.  
> 
> To support your spatial queries you could use two standard standard CFs as 
> follows: 
> 
> distance CF: - key: experiment_id.time_interval - colunm name: 
> zero_padded_distance.pair_id - column value: empty or the angle  
> 
> angle CF : - key: experiment_id.time_interval - colunm name: 
> zero_padded_angle.pair_id - column value: empty or the distance 
> 
> (two pairs can have the same distance and/or angle in same time slice) 
> 
> Here we are using the column name as a compound value, and am assuming they 
> can be byte ordered. So for distance the column name looks something like 
> 000500.123456789. You would then use the Byte comparator (or similar) for the 
> columns.   
> 
> To find all of the particles for experiment 2 at t5 where distance is < 100 
> you would use a get_slice (see http://wiki.apache.org/cassandra/API or your 
> higher level client docs) against the key "2.5" with a SliceRange start at 
> "000000.000000000" and finish at "000100.999999999". Once you have this list 
> of columns you can either filter client side for the angle or issue another 
> query for the particles inside the angle range. Then join the two results 
> client side using the pair_id returned in the column names.  
> 
> By using the same key for all 3 CF's all the data for a time slice will be 
> stored on the same nodes. You can potentially spread this around by using 
> slightly different keys so they may hash to different areas of the cluster. 
> e.g. expriement_id.time_interval."distance" 
> 
> Data volume is not a concern, and it's not possible to talk about performance 
> until you have an idea of the workload and required throughput. But writes 
> are fast and I think your reads would be fast as well as the row data for 
> distance and angle will not change so caches will be be useful.    
> 
> Hope that helps.  Aaron 
> 
> On 12 Apr 2011, at 03:01, Shalom wrote: 
>> I would like to save statistics on 10,000,000 (ten millions) pairs of
>> particles, how they relate to one another in any given space in time.
>> 
>> So suppose that within a total experiment time of T1..T1000 (assume that T1
>> is when the experiment starts, and T1000 is the time when the experiment
>> ends) I would like, per each pair of particles, to measure the relationship
>> between every Tn -- T(n+1) interval:
>> 
>> T1..T2 (this is the first interval)
>> 
>> T2..T3
>> 
>> T3..T4
>> 
>> ......
>> 
>> ......
>> 
>> T9,999,999..T10,000,000 (this is the last interval)
>> 
>> For each such a particle pair (there are 10,000,000 pairs) I would like to
>> save some figures (such as distance, angel etc) on each interval of [
>> Tn..T(n+1) ]
>> 
>> Once saved, the query I will be using to retrieve this data is as follows:
>> "give me all particle pairs on time interval [ Tn..T(n+1) ] where the
>> distance between the two particles is smaller than X and the angle between
>> the two particles is greater than Y". Meaning, the query will always take
>> place for all particle pairs on a certain interval of time.
>> 
>> How would you model this in Cassandra, so that the writes/reads are
>> optimized? given the database size involved, can you recommend on a suitable
>> solution? (I have been recommended to both MongoDB / Cassandra).
>> 
>> I should mention that the data does change often -- we run many such
>> experiments (different particle sets / thousands of experiments) and would
>> need a very decent performance of reads/writes.
>> 
>> Is Cassandra suitable for this time of work?
>> 
>> 
>> --
>> View this message in context: 
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html
>> Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
>> Nabble.com.
>> 
> 
> 
>

Re: Cassandra Database Modeling

Reply via email to