I'm trying to get my head around Cassandra data modeling, but I can't quite
see what would be the best approach to the problem I have.
The supposed scenario:
You have around 100 domains, each domain have from few hundreds to millions
of possible URLs (think of different combinations of GET args,
example.org?a=one&b=two is different of example.org?b=two&a=one)


The application requirements
- two columns storing an average of 500kb each and four (maybe six) columns
storing 1kb each
- retrieve single oldest/newest URL of any single domain
- retrieve a range of oldest/newest URLs of any single domain
- retrieve single oldest/newest URL over all
- retrieve a range of oldest/newest URLs over all
- entries will be edited at least once a day (heavy read+write)

Having considered the following:
http://wiki.apache.org/cassandra/CassandraLimitations
http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage
http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds
https://issues.apache.org/jira/browse/CASSANDRA-16


**
Which of the models below would you go for, and why?
Any input would be appreciated


Model A
Hundreds of rows (domain names as row keys)
holding hundreds of thousands of columns (pages within that domain)
and each column then hold a few other columns (5 columns in this case)
Biggest row: "example.net" ~350Gb
Secondary index: column holding URL
{
   "example.com": {
       "example.com/a": ["1", "2", "3", "4", "5"],
       "example.com/b": ["1", "2", "3", "4", "5"],
       "example.com/c": ["1", "2", "3", "4", "5"],
   },
   "example.net": {
       "example.net/a": ["1", "2", "3", "4", "5"],
       "example.net/b": ["1", "2", "3", "4", "5"],
       "example.net/c": ["1", "2", "3", "4", "5"],
   },
   "example.org": {
       "example.org/a": ["1", "2", "3", "4", "5"],
       "example.org/b": ["1", "2", "3", "4", "5"],
       "example.org/c": ["1", "2", "3", "4", "5"],
   }
}


Model B
Millions of rows (URLs as row keys) each holding a few other columns (6
columns in this case).
Biggest row: any ~1004Kb
Secondary index: column holding the domain name
{
   "example.com/a": ["1", "2", "3", "4", "5", "example.com"],
   "example.com/b": ["1", "2", "3", "4", "5", "example.com"],
   "example.com/c": ["1", "2", "3", "4", "5", "example.com"],
   "example.net/a": ["1", "2", "3", "4", "5", "example.net"],
   "example.net/b": ["1", "2", "3", "4", "5", "example.net"],
   "example.net/c": ["1", "2", "3", "4", "5", "example.net"],
   "example.org/a": ["1", "2", "3", "4", "5", "example.org"],
   "example.org/b": ["1", "2", "3", "4", "5", "example.org"],
   "example.org/c": ["1", "2", "3", "4", "5", "example.org"],
}


Model C
Millions of rows (TimeUUID as row keys) each holding a few other columns (7
columns in this case).
Biggest row: any ~1004Kb
Secondary index: column holding the domain name & column holding URL
{
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"],
   "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"],
}

//END

Reply via email to