Re: Read performance in map data type

Sourabh Agrawal Sat, 29 Mar 2014 23:36:07 -0700

Hi,

I don't think there is a problem with the driver.


Regarding the schema, you may want to choose between wide rows and skinny
rows.
http://stackoverflow.com/questions/19039123/cassandra-wide-vs-skinny-rows-for-large-columns
http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html

When you do "studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID)"  you are partitioning data by studentID (wide
row pattern). So, there will be one row for each studentID. So, all reads
for a studentID will go to a single node.
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_compound_keys_c.html

If you do "studentID int, subjectID int, marks int, PRIMARY KEY((studentID,
subjectID))" you are partitioning data by a composite value of both
columns. So read requests will be distributed. But now you can not query
with only studentID.
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_table_r.html#reference_ds_v3f_vfk_xj__compositPart

*You should not use a map because it has various restrictions* :
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddlWhenCollections.html



On Sat, Mar 29, 2014 at 5:13 PM, Apoorva Gaurav
<apoorva.gau...@myntra.com>wrote:

> Hello Sourabh,
>
> I'd prefer to do query like select * from marks_table where studentID = ?
> and subjectID in (?, ?, ?....?) but if its costly then can happily delegate
> the responsibility to the application layer.
>
> Haven't tried 2.x java driver for this specific issue but tried it once
> earlier and found the performance slower than 1.x; isn't so?
>
>
> On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal 
> <iitr.sour...@gmail.com>wrote:
>
>> Hi Apoorva,
>>
>> Do you always query on studentID only or do you need to query on both
>> studentID and subjectID?
>>
>> Also, I think using the latest driver (2.x) can make querying large
>> number of rows efficient.
>> http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
>>
>>
>>
>>
>> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <
>> apoorva.gau...@myntra.com> wrote:
>>
>>> Hello Shrikar,
>>>
>>> Yes primary key is (studentID, subjectID). I had dropped the test table,
>>> recreating and populating it post which will share the cfhistogram. In such
>>> case is there any practical limit on the rows I should fetch, for e.g.
>>> should I do
>>>        select * form marks_table where studentID = ? limit 500;
>>> instead of doing
>>>        select * form marks_table where studentID = ?;
>>>
>>>
>>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak <shrika...@gmail.com>wrote:
>>>
>>>> Hi Apoorva,
>>>>
>>>> I assume this is the table with studentId and subjectId  as primary
>>>> keys and not other like like marks in that.
>>>>
>>>> create table marks_table(studentId int, subjectId int, marks int,
>>>> PRIMARY KEY(studentId,subjectId));
>>>>
>>>> Also could you give the cfhistogram stats?
>>>>
>>>> nodetool cfhistograms <your keyspace> marks_table;
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> Shrikar
>>>>
>>>>
>>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>>>> apoorva.gau...@myntra.com> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>>>> where combination of studentID and subjectID is unique. Number of 
>>>>> studentID
>>>>> can go up to 100 million and for each studentID we can have up to  10k
>>>>> subjectIDs.
>>>>>
>>>>> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
>>>>> are using a four node cluster, each having 24 cores and 32GB memory. I'm
>>>>> sure that the machines are not underperformant as on same test bed we've
>>>>> consistently received <5ms response times for ~1b documents when queried
>>>>> via primary key.
>>>>>
>>>>> I've tried three approaches, all of which result in significant
>>>>> deterioration (>500 ms response time) in read query performance once 
>>>>> number
>>>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>>>
>>>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map<int,
>>>>> int>) and query by subjectID
>>>>>
>>>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>>> studentID = ?
>>>>>
>>>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>>>> studentID = ? and subjectID in (?, ?, ?....?)  number of subjectIDs in
>>>>> query being ~1K.
>>>>>
>>>>> What can be the bottlenecks. Is it better if we model as (studentID
>>>>> int, subjct_marks_json text) and query by studentID.
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Apoorva
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Apoorva
>>>
>>
>>
>>
>> --
>> Sourabh Agrawal
>> Bangalore
>> +91 9945657973
>>
>
>
>
> --
> Thanks & Regards,
> Apoorva
>



-- 
Sourabh Agrawal
Bangalore
+91 9945657973

Re: Read performance in map data type

Reply via email to