> a) "These queries are not easily supported on standard Cassandra" > select * from book where price < 992 order by price descending limit 30; > > This is a typical (time series data)timeline query well supported by > Cassandra, from my understanding. Queries that use a secondary index (on price) must include an equality operator.
> > b) "You do not need a different CF for each custom secondary index. > Try putting the name of the index in the row key. " > > I couldn't understand it. Can you help to build an demo with CF > structure and some sample data? You can have one CF that contains multiple secondary indexes. key: col_1:value_1 col_name: entity_id_1 key: col_2:value_2 col_name: entity_id_1 Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 11/04/2012, at 7:24 AM, Data Craftsman wrote: > Hi Aaron, > > Thanks for the quick answer, I'll build a prototype to benchmark each > approach next week. > > Here are more questions based on your reply: > > a) "These queries are not easily supported on standard Cassandra" > select * from book where price < 992 order by price descending limit 30; > > This is a typical (time series data)timeline query well supported by > Cassandra, from my understanding. > > b) "You do not need a different CF for each custom secondary index. > Try putting the name of the index in the row key. " > > I couldn't understand it. Can you help to build an demo with CF > structure and some sample data? > > Thanks, > Charlie | DBA developer > > > > On Sun, Apr 8, 2012 at 2:30 PM, aaron morton <aa...@thelastpickle.com> wrote: >> We need to query data by each column, do pagination as below, >> >> select * from book where isbn < "XYZ" order by ISBN descending limit 30; >> select * from book where price < 992 order by price descending limit 30; >> select * from book where col_n1 < 789 order by col_n1 descending limit 30; >> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30; >> ... >> select * from book where col_nm < 978 order by col_nm descending limit 30; >> >> These queries are not easily supported on standard Cassandra. If you need >> this level of query complexity consider Data Stax Enterprise, Solr, or a >> RDBMS. >> >> If we choose Materialized Views approach, we have to update all >> 20 Materialized View column family(s), for each base row update. >> Will the Cassandra write performance acceptable? >> >> Yes, depending on the size of the cluster and the machine spec. >> >> It's often a good idea to design CF's to match the workloads. If you have >> some data that changes faster than other, consider splitting them into >> different CFs. >> >> Should we just normalize the data, create base book table with book_id >> as primary key, and then >> build 20 index column family(s), use wide row column slicing approach, >> with index column data value as column name and book_id as value? >> >> You do not need a different CF for each custom secondary index. Try putting >> the name of the index in the row key. >> >> What will you recommend? >> >> Take another look at the queries you *need* to support. Then build a small >> proof of concept to see if Cassandra will work for you. >> >> Hope that helps. >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 6/04/2012, at 6:46 AM, Data Craftsman wrote: >> >> Howdy, >> >> Can I ask a data model question here? >> >> We have a book table with 20 columns, 300 million rows, average row >> size is 1500 bytes. >> >> create table book( >> book_id, >> isbn, >> price, >> author, >> titile, >> ... >> col_n1, >> col_n2, >> ... >> col_nm >> ); >> >> Data usage: >> >> We need to query data by each column, do pagination as below, >> >> select * from book where isbn < "XYZ" order by ISBN descending limit 30; >> select * from book where price < 992 order by price descending limit 30; >> select * from book where col_n1 < 789 order by col_n1 descending limit 30; >> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30; >> ... >> select * from book where col_nm < 978 order by col_nm descending limit 30; >> >> Write: 100 million updates a day. >> Read : 16 million queries a day. 200 queries per second, one query >> returns 30 rows. >> >> *** >> Materialized Views approach >> >> {"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN} >> ... >> We will end up with 20 timelines. >> >> >> *** >> Index approach - create 2nd Column Family as Index >> >> 'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN' >> 'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN' >> ... >> 'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN' >> >> This way, we will create 20 index Column Family(s). >> >> --- >> >> If we choose Materialized Views approach, we have to update all >> 20 Materialized View column family(s), for each base row update. >> Will the Cassandra write performance acceptable? >> >> Redis recommend building an index for the query on each column, that >> is your 1st strategy - create 2nd index CF: >> http://redis.io/topics/data-types-intro >> (see section [ Pushing IDs instead of the actual data in Redis lists ] >> >> Should we just normalize the data, create base book table with book_id >> as primary key, and then >> build 20 index column family(s), use wide row column slicing approach, >> with index column data value as column name and book_id as value? >> This way, we only need to update fewer affected column family that >> column value changed, but not all 20 Materialized Views CF(s). >> >> Another option would be using Redis to store master book data, using >> Cassandra Column Family to maintain 2nd index. >> >> What will you recommend? >> >> Charlie (@mujiang) 一个 木匠 >> ======= >> Data Architect Developer >> http://mujiang.blogspot.com >> >> >> p.s. >> >> Gist from datastax dev blog ( >> http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra ) >> " >> If the same event is tracked in multiple timelines, >> it's okay to denormalize and store all of the event data in each of >> those timelines. >> One of the main principles that Cassandra was built on is that disk >> space is very cheap resource; >> minimizing disk seeks at the cost of higher space consumption is a >> good tradeoff. >> Unless the data for each event is ^very large^, I always prefer this >> strategy over the index strategy. >> " >> >> Will 1500 bytes row size be large or small for Cassandra from your >> understanding? >> >> > > > > -- > -- > Thanks, > > Charlie (@mujiang) 一个 木匠 > ======= > Data Architect Developer > http://mujiang.blogspot.com