Re: Read performance in map data type

Sourabh Agrawal Sat, 29 Mar 2014 03:01:31 -0700

Hi Apoorva,

Do you always query on studentID only or do you need to query on both
studentID and subjectID?


Also, I think using the latest driver (2.x) can make querying large number
of rows efficient.
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0




On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav
<apoorva.gau...@myntra.com>wrote:

> Hello Shrikar,
>
> Yes primary key is (studentID, subjectID). I had dropped the test table,
> recreating and populating it post which will share the cfhistogram. In such
> case is there any practical limit on the rows I should fetch, for e.g.
> should I do
>        select * form marks_table where studentID = ? limit 500;
> instead of doing
>        select * form marks_table where studentID = ?;
>
>
> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrika...@gmail.com>wrote:
>
>> Hi Apoorva,
>>
>> I assume this is the table with studentId and subjectId  as primary keys
>> and not other like like marks in that.
>>
>> create table marks_table(studentId int, subjectId int, marks int, PRIMARY
>> KEY(studentId,subjectId));
>>
>> Also could you give the cfhistogram stats?
>>
>> nodetool cfhistograms <your keyspace> marks_table;
>>
>>
>>
>> Thanks,
>> Shrikar
>>
>>
>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>> apoorva.gau...@myntra.com> wrote:
>>
>>> Hello All,
>>>
>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>> where combination of studentID and subjectID is unique. Number of studentID
>>> can go up to 100 million and for each studentID we can have up to  10k
>>> subjectIDs.
>>>
>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm
>>> sure that the machines are not underperformant as on same test bed we've
>>> consistently received <5ms response times for ~1b documents when queried
>>> via primary key.
>>>
>>> I've tried three approaches, all of which result in significant
>>> deterioration (>500 ms response time) in read query performance once number
>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>
>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>>> int>) and query by subjectID
>>>
>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>> studentID = ?
>>>
>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>>> query being ~1K.
>>>
>>> What can be the bottlenecks. Is it better if we model as (studentID int,
>>> subjct_marks_json text) and query by studentID.
>>>
>>> --
>>> Thanks & Regards,
>>> Apoorva
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Apoorva
>



-- 
Sourabh Agrawal
Bangalore
+91 9945657973

Re: Read performance in map data type

Reply via email to