Re: Cassandra data modeling

Thamizh Thu, 29 Sep 2011 03:14:12 -0700

If  the retrieval of URL is based on "TimeUUID". Then Model C with 
ByteOrderedPartitioner and rowkey as long type of "TimeUUID" can be correct 
choice and it helps you to apply range query based on TimeUUID.


Regards,
Thamizhannal P


________________________________
From: M Vieira <mvfreelan...@gmail.com>
To: user@cassandra.apache.org
Sent: Thursday, 29 September 2011 2:54 PM
Subject: Cassandra data modeling



I'm trying to get my head around Cassandra data modeling, but I can't quite see 
what would be the best approach to the problem I have.
The supposed scenario: 
You have around 100 domains, each domain have from few hundreds to millions of 
possible URLs (think of different combinations of GET args,  
example.org?a=one&b=two is different of example.org?b=two&a=one)


The application requirements
- two columns storing an average of 500kb each and four (maybe six) columns 
storing 1kb each
- retrieve single oldest/newest URL of any single domain
- retrieve a range of oldest/newest URLs of any single domain
- retrieve single oldest/newest URL over all
- retrieve a range of oldest/newest URLs over all
- entries will be edited at least once a day (heavy read+write)

Having considered the following:
http://wiki.apache.org/cassandra/CassandraLimitations
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds
https://issues.apache.org/jira/browse/CASSANDRA-16



Which of the models below would you go for, and why?
Any input would be appreciated


Model A
Hundreds of rows (domain names as row keys) 
holding hundreds of thousands of columns (pages within that domain)
and each column then hold a few other columns (5 columns in this case)
Biggest row: "example.net" ~350Gb
Secondary index: column holding URL
{
   "example.com": {
       "example.com/a": ["1", "2", "3", "4", "5"],
       "example.com/b": ["1", "2", "3", "4", "5"],
       "example.com/c": ["1", "2", "3", "4", "5"],
   },
   "example.net": {
       "example.net/a": ["1", "2", "3", "4", "5"],
       "example.net/b": ["1", "2", "3", "4", "5"],
       "example.net/c": ["1", "2", "3", "4", "5"],
   },
   "example.org": {
       "example.org/a": ["1", "2", "3", "4", "5"],
       "example.org/b": ["1", "2", "3", "4", "5"],
       "example.org/c": ["1", "2", "3", "4", "5"],
   }
}


Model B
Millions of rows (URLs as row keys) each holding a few other columns (6 columns 
in this case).
Biggest row: any ~1004Kb
Secondary index: column holding the domain name
{
   "example.com/a": ["1", "2", "3", "4", "5", "example.com"],
   "example.com/b": ["1", "2", "3", "4", "5", "example.com"],
   "example.com/c": ["1", "2", "3", "4", "5", "example.com"],
   "example.net/a": ["1", "2", "3", "4", "5", "example.net"],
   "example.net/b": ["1", "2", "3", "4", "5", "example.net"],
   "example.net/c": ["1", "2", "3", "4", "5", "example.net"],
   "example.org/a": ["1", "2", "3", "4", "5", "example.org"],
   "example.org/b": ["1", "2", "3", "4", "5", "example.org"],
   "example.org/c": ["1", "2", "3", "4", "5", "example.org"],
}


Model C
Millions of rows (TimeUUID as row keys) each holding a few other columns (7 
columns in this case).
Biggest row: any ~1004Kb
Secondary index: column holding the domain name & column holding URL
{
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"],
}

//END

Re: Cassandra data modeling

Reply via email to