Hi Aaron, Thanks for the quick answer, I'll build a prototype to benchmark each approach next week.
Here are more questions based on your reply: a) "These queries are not easily supported on standard Cassandra" select * from book where price < 992 order by price descending limit 30; This is a typical (time series data)timeline query well supported by Cassandra, from my understanding. b) "You do not need a different CF for each custom secondary index. Try putting the name of the index in the row key. " I couldn't understand it. Can you help to build an demo with CF structure and some sample data? Thanks, Charlie | DBA developer On Sun, Apr 8, 2012 at 2:30 PM, aaron morton <aa...@thelastpickle.com> wrote: > We need to query data by each column, do pagination as below, > > select * from book where isbn < "XYZ" order by ISBN descending limit 30; > select * from book where price < 992 order by price descending limit 30; > select * from book where col_n1 < 789 order by col_n1 descending limit 30; > select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30; > ... > select * from book where col_nm < 978 order by col_nm descending limit 30; > > These queries are not easily supported on standard Cassandra. If you need > this level of query complexity consider Data Stax Enterprise, Solr, or a > RDBMS. > > If we choose Materialized Views approach, we have to update all > 20 Materialized View column family(s), for each base row update. > Will the Cassandra write performance acceptable? > > Yes, depending on the size of the cluster and the machine spec. > > It's often a good idea to design CF's to match the workloads. If you have > some data that changes faster than other, consider splitting them into > different CFs. > > Should we just normalize the data, create base book table with book_id > as primary key, and then > build 20 index column family(s), use wide row column slicing approach, > with index column data value as column name and book_id as value? > > You do not need a different CF for each custom secondary index. Try putting > the name of the index in the row key. > > What will you recommend? > > Take another look at the queries you *need* to support. Then build a small > proof of concept to see if Cassandra will work for you. > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 6/04/2012, at 6:46 AM, Data Craftsman wrote: > > Howdy, > > Can I ask a data model question here? > > We have a book table with 20 columns, 300 million rows, average row > size is 1500 bytes. > > create table book( > book_id, > isbn, > price, > author, > titile, > ... > col_n1, > col_n2, > ... > col_nm > ); > > Data usage: > > We need to query data by each column, do pagination as below, > > select * from book where isbn < "XYZ" order by ISBN descending limit 30; > select * from book where price < 992 order by price descending limit 30; > select * from book where col_n1 < 789 order by col_n1 descending limit 30; > select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30; > ... > select * from book where col_nm < 978 order by col_nm descending limit 30; > > Write: 100 million updates a day. > Read : 16 million queries a day. 200 queries per second, one query > returns 30 rows. > > *** > Materialized Views approach > > {"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN} > ... > We will end up with 20 timelines. > > > *** > Index approach - create 2nd Column Family as Index > > 'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN' > 'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN' > ... > 'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN' > > This way, we will create 20 index Column Family(s). > > --- > > If we choose Materialized Views approach, we have to update all > 20 Materialized View column family(s), for each base row update. > Will the Cassandra write performance acceptable? > > Redis recommend building an index for the query on each column, that > is your 1st strategy - create 2nd index CF: > http://redis.io/topics/data-types-intro > (see section [ Pushing IDs instead of the actual data in Redis lists ] > > Should we just normalize the data, create base book table with book_id > as primary key, and then > build 20 index column family(s), use wide row column slicing approach, > with index column data value as column name and book_id as value? > This way, we only need to update fewer affected column family that > column value changed, but not all 20 Materialized Views CF(s). > > Another option would be using Redis to store master book data, using > Cassandra Column Family to maintain 2nd index. > > What will you recommend? > > Charlie (@mujiang) 一个 木匠 > ======= > Data Architect Developer > http://mujiang.blogspot.com > > > p.s. > > Gist from datastax dev blog ( > http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra ) > " > If the same event is tracked in multiple timelines, > it's okay to denormalize and store all of the event data in each of > those timelines. > One of the main principles that Cassandra was built on is that disk > space is very cheap resource; > minimizing disk seeks at the cost of higher space consumption is a > good tradeoff. > Unless the data for each event is ^very large^, I always prefer this > strategy over the index strategy. > " > > Will 1500 bytes row size be large or small for Cassandra from your > understanding? > > -- -- Thanks, Charlie (@mujiang) 一个 木匠 ======= Data Architect Developer http://mujiang.blogspot.com