Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs.
We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received <5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (>500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int, int>) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ?....?) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks & Regards, Apoorva