If the retrieval of URL is based on "TimeUUID". Then Model C with ByteOrderedPartitioner and rowkey as long type of "TimeUUID" can be correct choice and it helps you to apply range query based on TimeUUID.
Regards, Thamizhannal P ________________________________ From: M Vieira <mvfreelan...@gmail.com> To: user@cassandra.apache.org Sent: Thursday, 29 September 2011 2:54 PM Subject: Cassandra data modeling I'm trying to get my head around Cassandra data modeling, but I can't quite see what would be the best approach to the problem I have. The supposed scenario: You have around 100 domains, each domain have from few hundreds to millions of possible URLs (think of different combinations of GET args, example.org?a=one&b=two is different of example.org?b=two&a=one) The application requirements - two columns storing an average of 500kb each and four (maybe six) columns storing 1kb each - retrieve single oldest/newest URL of any single domain - retrieve a range of oldest/newest URLs of any single domain - retrieve single oldest/newest URL over all - retrieve a range of oldest/newest URLs over all - entries will be edited at least once a day (heavy read+write) Having considered the following: http://wiki.apache.org/cassandra/CassandraLimitations http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage http://wiki.apache.org/cassandra/MemtableThresholds#Memtable_Thresholds https://issues.apache.org/jira/browse/CASSANDRA-16 Which of the models below would you go for, and why? Any input would be appreciated Model A Hundreds of rows (domain names as row keys) holding hundreds of thousands of columns (pages within that domain) and each column then hold a few other columns (5 columns in this case) Biggest row: "example.net" ~350Gb Secondary index: column holding URL { "example.com": { "example.com/a": ["1", "2", "3", "4", "5"], "example.com/b": ["1", "2", "3", "4", "5"], "example.com/c": ["1", "2", "3", "4", "5"], }, "example.net": { "example.net/a": ["1", "2", "3", "4", "5"], "example.net/b": ["1", "2", "3", "4", "5"], "example.net/c": ["1", "2", "3", "4", "5"], }, "example.org": { "example.org/a": ["1", "2", "3", "4", "5"], "example.org/b": ["1", "2", "3", "4", "5"], "example.org/c": ["1", "2", "3", "4", "5"], } } Model B Millions of rows (URLs as row keys) each holding a few other columns (6 columns in this case). Biggest row: any ~1004Kb Secondary index: column holding the domain name { "example.com/a": ["1", "2", "3", "4", "5", "example.com"], "example.com/b": ["1", "2", "3", "4", "5", "example.com"], "example.com/c": ["1", "2", "3", "4", "5", "example.com"], "example.net/a": ["1", "2", "3", "4", "5", "example.net"], "example.net/b": ["1", "2", "3", "4", "5", "example.net"], "example.net/c": ["1", "2", "3", "4", "5", "example.net"], "example.org/a": ["1", "2", "3", "4", "5", "example.org"], "example.org/b": ["1", "2", "3", "4", "5", "example.org"], "example.org/c": ["1", "2", "3", "4", "5", "example.org"], } Model C Millions of rows (TimeUUID as row keys) each holding a few other columns (7 columns in this case). Biggest row: any ~1004Kb Secondary index: column holding the domain name & column holding URL { "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/a"], "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/b"], "TimeUUID": ["1", "2", "3", "4", "5", "example.com", "example.com/c"], "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/a"], "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/b"], "TimeUUID": ["1", "2", "3", "4", "5", "example.net", "example.net/c"], "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/a"], "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/b"], "TimeUUID": ["1", "2", "3", "4", "5", "example.org", "example.org/c"], } //END