If you are collecting time series data, and assuming the flying turtles we live 
on that swim through time do not stop, you will want to partition your data. 
(background http://www.slideshare.net/mattdennis/cassandra-data-modeling)

Lets say it makes sense for you to partition by month (may not be the case but 
it's easy for now) so your partition keys will look like "201109". Also I'm not 
sure about the first requirement for columns storing 500KB of data, so i'll 
just talk about the urls. 

CF: domain_partitions - used to find which partitions the domain has data in
key = <domain> 
column name = <partition_key>
column value = EMPTY

CF: url_time_series - store the url's for a domain in a partition
key = <domain> '+' <partition_key>
column name =  time uuid
column value = url


CF: url_payload - store additional url data
key = <domain> '+' <partition_key> + <time_uuid>

Requests:

* store a new hit
        - work out the current partition
        - batch mutate to update domain_partitions, url_time_series and  if 
needed url_payload  
        - use a special "ALL" domain and store it there too

* get oldest / newest url for a domain (same thing for a range)
        - get the oldest / newest column from the domain_partitions CF
        - get the oldest / newest col from the url_time_series CF using the 
partition

* get the oldest / newest for ALL domains
        - do the same as above but use the all domain

Notes:
- I split the payload out because I was not sure when you just wanted the URL 
and when you wanted all the other data. 
- You should look at using composite types 
http://www.slideshare.net/edanuff/indexing-in-cassandra
- I've probably missed things

Hope that helps, good luck. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 29/09/2011, at 11:13 PM, Thamizh wrote:

> If  the retrieval of URL is based on "TimeUUID". Then Model C with 
> ByteOrderedPartitioner and rowkey as long type of "TimeUUID" can be correct 
> choice and it helps you to apply range query based on TimeUUID.
> 
> Regards,
> Thamizhannal P
> From: M Vieira <mvfreelan...@gmail.com>
> To: user@cassandra.apache.org
> Sent: Thursday, 29 September 2011 2:54 PM
> Subject: Cassandra data modeling
> 
> 
> I'm trying to get my head around Cassandra data modeling, but I can't quite 
> see what would be the best approach to the problem I have.
> The supposed scenario: 
> You have around 100 domains, each domain have from few hundreds to millions 
> of possible URLs (think of different combinations of GET args,  
> example.org?a=one&b=two is different of example.org?b=two&a=one)
> 
> 
> The application requirements
> - two columns storing an average of 500kb each and four (maybe six) columns 
> storing 1kb each
> - retrieve single oldest/newest URL of any single domain
> - retrieve a range of oldest/newest URLs of any single domain
> - retrieve single oldest/newest URL over all
> - retrieve a range of oldest/newest URLs over all
> - entries will be edited at least once a day (heavy read+write)
> 
> Having considered the following:
> http://wiki.apache.org/cassandra/CassandraLimitations
> http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
> http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds
> https://issues.apache.org/jira/browse/CASSANDRA-16
> 
> 
> 
> Which of the models below would you go for, and why?
> Any input would be appreciated
> 
> 
> Model A
> Hundreds of rows (domain names as row keys) 
> holding hundreds of thousands of columns (pages within that domain)
> and each column then hold a few other columns (5 columns in this case)
> Biggest row: "example.net" ~350Gb
> Secondary index: column holding URL
> {
>    "example.com": {
>        "example.com/a": ["1", "2", "3", "4", "5"],
>        "example.com/b": ["1", "2", "3", "4", "5"],
>        "example.com/c": ["1", "2", "3", "4", "5"],
>    },
>    "example.net": {
>        "example.net/a": ["1", "2", "3", "4", "5"],
>        "example.net/b": ["1", "2", "3", "4", "5"],
>        "example.net/c": ["1", "2", "3", "4", "5"],
>    },
>    "example.org": {
>        "example.org/a": ["1", "2", "3", "4", "5"],
>        "example.org/b": ["1", "2", "3", "4", "5"],
>        "example.org/c": ["1", "2", "3", "4", "5"],
>    }
> }
> 
> 
> Model B
> Millions of rows (URLs as row keys) each holding a few other columns (6 
> columns in this case).
> Biggest row: any ~1004Kb
> Secondary index: column holding the domain name
> {
>    "example.com/a": ["1", "2", "3", "4", "5", "example.com"],
>    "example.com/b": ["1", "2", "3", "4", "5", "example.com"],
>    "example.com/c": ["1", "2", "3", "4", "5", "example.com"],
>    "example.net/a": ["1", "2", "3", "4", "5", "example.net"],
>    "example.net/b": ["1", "2", "3", "4", "5", "example.net"],
>    "example.net/c": ["1", "2", "3", "4", "5", "example.net"],
>    "example.org/a": ["1", "2", "3", "4", "5", "example.org"],
>    "example.org/b": ["1", "2", "3", "4", "5", "example.org"],
>    "example.org/c": ["1", "2", "3", "4", "5", "example.org"],
> }
> 
> 
> Model C
> Millions of rows (TimeUUID as row keys) each holding a few other columns (7 
> columns in this case).
> Biggest row: any ~1004Kb
> Secondary index: column holding the domain name & column holding URL
> {
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"],
>    "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"],
> }
> 
> //END
> 
> 
>  
> 
> 
> 

Reply via email to