Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that.
create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms <your keyspace> marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <apoorva.gau...@myntra.com>wrote: > Hello All, > > We've a schema which can be modeled as (studentID, subjectID, marks) where > combination of studentID and subjectID is unique. Number of studentID can > go up to 100 million and for each studentID we can have up to 10k > subjectIDs. > > We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are > using a four node cluster, each having 24 cores and 32GB memory. I'm sure > that the machines are not underperformant as on same test bed we've > consistently received <5ms response times for ~1b documents when queried > via primary key. > > I've tried three approaches, all of which result in significant > deterioration (>500 ms response time) in read query performance once number > of subjectIDs goes past ~100 for a studentID. Approaches are :- > > 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int, int>) > and query by subjectID > > 2. model as (studentID int, subjectID int, marks int, PRIMARY > KEY(studentID, subjectID) and query as select * from marks_table where > studentID = ? > > 3. model as (studentID int, subjectID int, marks int, PRIMARY > KEY(studentID, subjectID) and query as select * from marks_table where > studentID = ? and subjectID in (?, ?, ?....?) number of subjectIDs in > query being ~1K. > > What can be the bottlenecks. Is it better if we model as (studentID int, > subjct_marks_json text) and query by studentID. > > -- > Thanks & Regards, > Apoorva >