Hi Aaron,

Thanks for the quick answer, I'll build a prototype to benchmark each
approach next week.

Here are more questions based on your reply:

a) "These queries are not easily supported on standard Cassandra"
select * from book where price  < 992   order by price descending limit 30;

This is a typical (time series data)timeline query well supported by
Cassandra, from my understanding.

b) "You do not need a different CF for each custom secondary index.
Try putting the name of the index in the row key. "

I couldn't understand it. Can you help to build an demo with CF
structure and some sample data?

Thanks,
Charlie | DBA developer



On Sun, Apr 8, 2012 at 2:30 PM, aaron morton <aa...@thelastpickle.com> wrote:
> We need to query data by each column, do pagination as below,
>
> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
> select * from book where price  < 992   order by price  descending limit 30;
> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
> ...
> select * from book where col_nm < 978 order by col_nm descending limit 30;
>
> These queries are not easily supported on standard Cassandra. If you need
> this level of query complexity consider Data Stax Enterprise, Solr, or a
> RDBMS.
>
> If we choose Materialized Views approach, we have to update all
> 20 Materialized View column family(s), for each base row update.
> Will the Cassandra write performance acceptable?
>
> Yes, depending on the size of the cluster and the machine spec.
>
> It's often a good idea to design CF's to match the workloads. If you have
> some data that changes faster than other, consider splitting them into
> different CFs.
>
> Should we just normalize the data, create base book table with book_id
> as primary key, and then
> build 20 index column family(s), use wide row column slicing approach,
> with index column data value as column name and book_id as value?
>
> You do not need a different CF for each custom secondary index. Try putting
> the name of the index in the row key.
>
> What will you recommend?
>
> Take another look at the queries you *need* to support. Then build a small
> proof of concept to see if Cassandra will work for you.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 6/04/2012, at 6:46 AM, Data Craftsman wrote:
>
> Howdy,
>
> Can I ask a data model question here?
>
> We have a book table with 20 columns, 300 million rows, average row
> size is 1500 bytes.
>
> create table book(
> book_id,
> isbn,
> price,
> author,
> titile,
> ...
> col_n1,
> col_n2,
> ...
> col_nm
> );
>
> Data usage:
>
> We need to query data by each column, do pagination as below,
>
> select * from book where isbn   < "XYZ" order by ISBN   descending limit 30;
> select * from book where price  < 992   order by price  descending limit 30;
> select * from book where col_n1 < 789   order by col_n1 descending limit 30;
> select * from book where col_n2 < "MUJ" order by col_n2 descending limit 30;
> ...
> select * from book where col_nm < 978 order by col_nm descending limit 30;
>
> Write: 100 million updates a day.
> Read : 16  million queries a day. 200 queries per second, one query
> returns 30 rows.
>
> ***
> Materialized Views approach
>
> {"ISBN_01",book_object1},{"ISBN_02",book_object2},...,{"ISBN_N",book_objectN}
> ...
> We will end up with 20 timelines.
>
>
> ***
> Index approach - create 2nd Column Family as Index
>
> 'ISBN_01': 'book_id_a01','book_id_a02',...,'book_id_aN'
> 'ISBN_02': 'book_id_b01','book_id_b02',...,'book_id_bN'
> ...
> 'ISBN_0m': 'book_id_m01','book_id_m02',...,'book_id_mN'
>
> This way, we will create 20 index Column Family(s).
>
> ---
>
> If we choose Materialized Views approach, we have to update all
> 20 Materialized View column family(s), for each base row update.
> Will the Cassandra write performance acceptable?
>
> Redis recommend building an index for the query on each column, that
> is your 1st strategy - create 2nd index CF:
> http://redis.io/topics/data-types-intro
> (see section [ Pushing IDs instead of the actual data in Redis lists ]
>
> Should we just normalize the data, create base book table with book_id
> as primary key, and then
> build 20 index column family(s), use wide row column slicing approach,
> with index column data value as column name and book_id as value?
> This way, we only need to update fewer affected column family that
> column value changed, but not all 20 Materialized Views CF(s).
>
> Another option would be using Redis to store master book data, using
> Cassandra Column Family to maintain 2nd index.
>
> What will you recommend?
>
> Charlie (@mujiang) 一个 木匠
> =======
> Data Architect Developer
> http://mujiang.blogspot.com
>
>
> p.s.
>
> Gist from datastax dev blog (
> http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra )
> "
> If the same event is tracked in multiple timelines,
> it's okay to denormalize and store all of the event data in each of
> those timelines.
> One of the main principles that Cassandra was built on is that disk
> space is very cheap resource;
> minimizing disk seeks at the cost of higher space consumption is a
> good tradeoff.
> Unless the data for each event is ^very large^, I always prefer this
> strategy over the index strategy.
> "
>
> Will 1500 bytes row size be large or small for Cassandra from your
> understanding?
>
>



-- 
--
Thanks,

Charlie (@mujiang) 一个 木匠
=======
Data Architect Developer
http://mujiang.blogspot.com

Reply via email to