> > Is there a limit to the size that can be stored in one 'cell' (by 'cell' I > mean the intersection between a *key* and a *data column*)? is there a > limit to the size of data of one *key*? one *data column*? >
http://wiki.apache.org/cassandra/CassandraLimitations <http://wiki.apache.org/cassandra/CassandraLimitations>The data of cassandra are partitioned by the row key; therefore, if you want to put all pairs into the same row, you should consider the disk size. > > Thanks in advance for any help / guidance. > > -----Original Message----- > *From*: aaron morton > <aa...@thelastpickle.com<aaron%20morton%20%3caa...@thelastpickle.com%3e> > > > *Reply-to*: user@cassandra.apache.org > *To*: user@cassandra.apache.org > *Subject*: Re: Cassandra Database Modeling > *Date*: Wed, 13 Apr 2011 10:14:21 +1200 > > Yes for interactive == real time queries. Hadoop based techniques are non > time critical queries, but they do have greater analytical capabilities. > > particle_pairs: 1) Yes and no and sort of. Under the hood the get_slice api > call will be used by your client library to pull back chunks of (ordered) > columns. Most client libraries abstract away the chunking for you. > > 2) If you are using a packed structure like JSON then no, Cassandra will > have no idea what you've put in the columns other than bytes . It really > depends on how much data you have per pair, but generally it's easier to > pull back more data than try to get exactly what you need. Downside is you > have to update all the data. > > 3) No, you would need to update all the data for the pair. I was assuming > most of the data was written once, and that your simulation had something > like a stop-the-world phase between time slices where state was dumped and > then read to start the next interval. You could either read it first, or we > can come up with something else. > > distance_cf 1) the query would return an list of columns, which have a name > and value (as well as a timestamp and ttl). 2) depends on the client > library, if using python go for https://github.com/pycassa/pycassa It will > return objects 3) returning millions of columns is going to be slow, would > also be slow using a RDBMS. Creating millions objects in python is going to > be slow. You would need to have a better idea of what queries you will > actually want to run to see if it's *too* slow. If it is one approach is to > store the particles at the same distance in the same column, so you need to > read less columns. Again depends on how your sim works. Time complexity > depends on the number of columns read. Finding a row will not be O(1) as it > it may have to read from several files. Writes are more constant than reads. > But remember, you can have a lot of io and cpu power in your cluster. > > Best advice is to jump in and see if the data model works for you at a > small single node scale, most performance issues can be solved. > > Aaron > On 12 Apr 2011, at 15:34, csharpplusproject wrote: > > Hi Aaron, > > Yes, of course it helps, I am starting to get a flavor of *Cassandra* -- > thank you very much! > > First of all, by 'interactive' queries, are you referring to 'real-time' > queries? (meaning, where experiments data is 'streaming', data needs to be > stored and following that, the query needs to be run in real time)? > > *Looking at the design of the **particle pairs**:* > > - key: expriement_id.time_interval > - column name: pair_id > - column value: distance, angle, other data packed together as JSON or some > other format > > *A couple of questions:* > > (1) Will a query such as *pairID[ *expriement_id.time_interval* ] *will > basically return an array of all paidIDs for the experiment, where each item > is a 'packed' JSON? > (2) Would it be possible, rather than returning the whole JSON object per > every pairID, to get (say) only the distance? > (3) Would it be possible to easily update certain 'pairIDs' with new values > (for example, update pairIDs = {2389, 93434} with new *distance* values)? > > *Looking at the design of the **distance CF* (for example)*:* > > this is VERY INTERESTING. basically you are suggesting a design that will > save the actual distance between each pair of particles, and will allow > queries where we can find all pairIDs (for an experiment, on time_interval) > that meet a certain distance criteria. VERY, VERY INTERESTING! > > *A couple of questions:* > > (1) Will a query such as *distanceCF[ *expriement_id.time_interval* ] *will > basically return an array of all '*zero_padded_distance.pair_id*' elements > for the experiment? > (2) In such a case, I will get (presumably) a python list where every item > is a string (and I will need to process it)? > (3) Given the fact that we're doing a slice on millions of columns (?), any > idea how fast such an operation would be? > > > Just to make sure I understand, is it true that in both situations, the > query complexity is basically O(1) since it's simply a HASH? > > > Thank you for all of your help! > > Shalom. > > -----Original Message----- > *From*: aaron morton > <aa...@thelastpickle.com<aaron%20morton%20%3caa...@thelastpickle.com%3e> > > > *Reply-to*: user@cassandra.apache.org > *To*: user@cassandra.apache.org > *Subject*: Re: Cassandra Database Modeling > *Date*: Tue, 12 Apr 2011 10:43:42 +1200 > > The tricky part here is the level of flexibility you want for the querying. > In general you will want to denormalise to support the read queries. > > If your queries are not interactive you may be able to use Hadoop / Pig / > Hive e.g. http://www.datastax.com/products/brisk In which case you can > probably have a simpler data model where you spend less effort supporting > the queries. But it sounds like you need interactive queries as part of the > experiment. > > You could store the data per pair in a standard CF (lets call it the pair > cf) as follows: > > - key: expriement_id.time_interval - column name: pair_id - column value: > distance, angle, other data packed together as JSON or some other format > > This would support a basic record of what happened, for each time interval > you can get the list of all pairs and read their data. > > To support your spatial queries you could use two standard standard CFs as > follows: > > distance CF: - key: experiment_id.time_interval - colunm name: > zero_padded_distance.pair_id - column value: empty or the angle > > angle CF : - key: experiment_id.time_interval - colunm name: > zero_padded_angle.pair_id - column value: empty or the distance > > (two pairs can have the same distance and/or angle in same time slice) > > Here we are using the column name as a compound value, and am assuming they > can be byte ordered. So for distance the column name looks something like > 000500.123456789. You would then use the Byte comparator (or similar) for > the columns. > > To find all of the particles for experiment 2 at t5 where distance is < 100 > you would use a get_slice (see http://wiki.apache.org/cassandra/API or > your higher level client docs) against the key "2.5" with a SliceRange start > at "000000.000000000" and finish at "000100.999999999". Once you have this > list of columns you can either filter client side for the angle or issue > another query for the particles inside the angle range. Then join the two > results client side using the pair_id returned in the column names. > > By using the same key for all 3 CF's all the data for a time slice will be > stored on the same nodes. You can potentially spread this around by using > slightly different keys so they may hash to different areas of the cluster. > e.g. expriement_id.time_interval."distance" > > Data volume is not a concern, and it's not possible to talk about > performance until you have an idea of the workload and required throughput. > But writes are fast and I think your reads would be fast as well as the row > data for distance and angle will not change so caches will be be useful. > > Hope that helps. Aaron > > On 12 Apr 2011, at 03:01, Shalom wrote: > > I would like to save statistics on 10,000,000 (ten millions) pairs of > particles, how they relate to one another in any given space in time. > > So suppose that within a total experiment time of T1..T1000 (assume that T1 > is when the experiment starts, and T1000 is the time when the experiment > ends) I would like, per each pair of particles, to measure the relationship > between every Tn -- T(n+1) interval: > > T1..T2 (this is the first interval) > > T2..T3 > > T3..T4 > > ...... > > ...... > > T9,999,999..T10,000,000 (this is the last interval) > > For each such a particle pair (there are 10,000,000 pairs) I would like to > save some figures (such as distance, angel etc) on each interval of [ > Tn..T(n+1) ] > > Once saved, the query I will be using to retrieve this data is as follows: > "give me all particle pairs on time interval [ Tn..T(n+1) ] where the > distance between the two particles is smaller than X and the angle between > the two particles is greater than Y". Meaning, the query will always take > place for all particle pairs on a certain interval of time. > > How would you model this in Cassandra, so that the writes/reads are > optimized? given the database size involved, can you recommend on a > suitable > solution? (I have been recommended to both MongoDB / Cassandra). > > I should mention that the data does change often -- we run many such > experiments (different particle sets / thousands of experiments) and would > need a very decent performance of reads/writes. > > Is Cassandra suitable for this time of work? > > > -- > View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-Database-Modeling-tp6261778p6261778.html > Sent from the cassandra-u...@incubator.apache.org mailing list archive at > Nabble.com. > > > > > > > > >