Hi Apoorva, Do you always query on studentID only or do you need to query on both studentID and subjectID?
Also, I think using the latest driver (2.x) can make querying large number of rows efficient. http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <apoorva.gau...@myntra.com>wrote: > Hello Shrikar, > > Yes primary key is (studentID, subjectID). I had dropped the test table, > recreating and populating it post which will share the cfhistogram. In such > case is there any practical limit on the rows I should fetch, for e.g. > should I do > select * form marks_table where studentID = ? limit 500; > instead of doing > select * form marks_table where studentID = ?; > > > On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrika...@gmail.com>wrote: > >> Hi Apoorva, >> >> I assume this is the table with studentId and subjectId as primary keys >> and not other like like marks in that. >> >> create table marks_table(studentId int, subjectId int, marks int, PRIMARY >> KEY(studentId,subjectId)); >> >> Also could you give the cfhistogram stats? >> >> nodetool cfhistograms <your keyspace> marks_table; >> >> >> >> Thanks, >> Shrikar >> >> >> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < >> apoorva.gau...@myntra.com> wrote: >> >>> Hello All, >>> >>> We've a schema which can be modeled as (studentID, subjectID, marks) >>> where combination of studentID and subjectID is unique. Number of studentID >>> can go up to 100 million and for each studentID we can have up to 10k >>> subjectIDs. >>> >>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We >>> are using a four node cluster, each having 24 cores and 32GB memory. I'm >>> sure that the machines are not underperformant as on same test bed we've >>> consistently received <5ms response times for ~1b documents when queried >>> via primary key. >>> >>> I've tried three approaches, all of which result in significant >>> deterioration (>500 ms response time) in read query performance once number >>> of subjectIDs goes past ~100 for a studentID. Approaches are :- >>> >>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int, >>> int>) and query by subjectID >>> >>> 2. model as (studentID int, subjectID int, marks int, PRIMARY >>> KEY(studentID, subjectID) and query as select * from marks_table where >>> studentID = ? >>> >>> 3. model as (studentID int, subjectID int, marks int, PRIMARY >>> KEY(studentID, subjectID) and query as select * from marks_table where >>> studentID = ? and subjectID in (?, ?, ?....?) number of subjectIDs in >>> query being ~1K. >>> >>> What can be the bottlenecks. Is it better if we model as (studentID int, >>> subjct_marks_json text) and query by studentID. >>> >>> -- >>> Thanks & Regards, >>> Apoorva >>> >> >> > > > -- > Thanks & Regards, > Apoorva > -- Sourabh Agrawal Bangalore +91 9945657973