Hello All,

We've a schema which can be modeled as (studentID, subjectID, marks) where
combination of studentID and subjectID is unique. Number of studentID can
go up to 100 million and for each studentID we can have up to  10k
subjectIDs.

We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are
using a four node cluster, each having 24 cores and 32GB memory. I'm sure
that the machines are not underperformant as on same test bed we've
consistently received <5ms response times for ~1b documents when queried
via primary key.

I've tried three approaches, all of which result in significant
deterioration (>500 ms response time) in read query performance once number
of subjectIDs goes past ~100 for a studentID. Approaches are :-

1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int, int>)
and query by subjectID

2. model as (studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID) and query as select * from marks_table where
studentID = ?

3. model as (studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID) and query as select * from marks_table where
studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
query being ~1K.

What can be the bottlenecks. Is it better if we model as (studentID int,
subjct_marks_json text) and query by studentID.

-- 
Thanks & Regards,
Apoorva

Reply via email to