Re: Read performance in map data type

2014-04-03 Thread Shrikar archak
Hi Apoorva,
As per the cfhistogram there are some rows which have more than 75k columns
and around 150k reads hit 2 SStables.

Are you sure that you are seeing more than 500ms latency?  The cfhistogram
should the worst read performance was around 51ms
which looks reasonable with many reads hitting 2 sstables.

Thanks,
Shrikar


On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav
wrote:

> Hello Shrikar,
>
> We are still facing read latency issue, here is the histogram
> http://pastebin.com/yEvMuHYh
>
>
> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav  > wrote:
>
>> Hello Shrikar,
>>
>> Yes primary key is (studentID, subjectID). I had dropped the test table,
>> recreating and populating it post which will share the cfhistogram. In such
>> case is there any practical limit on the rows I should fetch, for e.g.
>> should I do
>>select * form marks_table where studentID = ? limit 500;
>> instead of doing
>>select * form marks_table where studentID = ?;
>>
>>
>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote:
>>
>>> Hi Apoorva,
>>>
>>> I assume this is the table with studentId and subjectId  as primary keys
>>> and not other like like marks in that.
>>>
>>> create table marks_table(studentId int, subjectId int, marks int,
>>> PRIMARY KEY(studentId,subjectId));
>>>
>>> Also could you give the cfhistogram stats?
>>>
>>> nodetool cfhistograms  marks_table;
>>>
>>>
>>>
>>> Thanks,
>>> Shrikar
>>>
>>>
>>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>>> apoorva.gau...@myntra.com> wrote:
>>>
 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received <5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (>500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map>>> int>) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks & Regards,
 Apoorva

>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>
>
> --
> Thanks & Regards,
> Apoorva
>


Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
At the client side we are getting a latency of ~350ms, we are using
datastax driver 2.0.0 and have kept the fetch size as 500. And these are
coming while reading rows having ~200 columns.


On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak  wrote:

> Hi Apoorva,
> As per the cfhistogram there are some rows which have more than 75k
> columns and around 150k reads hit 2 SStables.
>
> Are you sure that you are seeing more than 500ms latency?  The cfhistogram
> should the worst read performance was around 51ms
> which looks reasonable with many reads hitting 2 sstables.
>
> Thanks,
> Shrikar
>
>
> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav  > wrote:
>
>> Hello Shrikar,
>>
>> We are still facing read latency issue, here is the histogram
>> http://pastebin.com/yEvMuHYh
>>
>>
>> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <
>> apoorva.gau...@myntra.com> wrote:
>>
>>> Hello Shrikar,
>>>
>>> Yes primary key is (studentID, subjectID). I had dropped the test table,
>>> recreating and populating it post which will share the cfhistogram. In such
>>> case is there any practical limit on the rows I should fetch, for e.g.
>>> should I do
>>>select * form marks_table where studentID = ? limit 500;
>>> instead of doing
>>>select * form marks_table where studentID = ?;
>>>
>>>
>>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote:
>>>
 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary
 keys and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms  marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
 apoorva.gau...@myntra.com> wrote:

> Hello All,
>
> We've a schema which can be modeled as (studentID, subjectID, marks)
> where combination of studentID and subjectID is unique. Number of 
> studentID
> can go up to 100 million and for each studentID we can have up to  10k
> subjectIDs.
>
> We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
> are using a four node cluster, each having 24 cores and 32GB memory. I'm
> sure that the machines are not underperformant as on same test bed we've
> consistently received <5ms response times for ~1b documents when queried
> via primary key.
>
> I've tried three approaches, all of which result in significant
> deterioration (>500 ms response time) in read query performance once 
> number
> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>
> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map int>) and query by subjectID
>
> 2. model as (studentID int, subjectID int, marks int, PRIMARY
> KEY(studentID, subjectID) and query as select * from marks_table where
> studentID = ?
>
> 3. model as (studentID int, subjectID int, marks int, PRIMARY
> KEY(studentID, subjectID) and query as select * from marks_table where
> studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
> query being ~1K.
>
> What can be the bottlenecks. Is it better if we model as (studentID
> int, subjct_marks_json text) and query by studentID.
>
> --
> Thanks & Regards,
> Apoorva
>


>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Apoorva
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>


-- 
Thanks & Regards,
Apoorva


Re: Read performance in map data type

2014-04-03 Thread Shrikar archak
How about the client side socket limits? Cassandra client side maximum
connection per host and read consistency level?

~Shrikar


On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav
wrote:

> At the client side we are getting a latency of ~350ms, we are using
> datastax driver 2.0.0 and have kept the fetch size as 500. And these are
> coming while reading rows having ~200 columns.
>
>
> On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak wrote:
>
>> Hi Apoorva,
>> As per the cfhistogram there are some rows which have more than 75k
>> columns and around 150k reads hit 2 SStables.
>>
>> Are you sure that you are seeing more than 500ms latency?  The
>> cfhistogram should the worst read performance was around 51ms
>> which looks reasonable with many reads hitting 2 sstables.
>>
>> Thanks,
>> Shrikar
>>
>>
>> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <
>> apoorva.gau...@myntra.com> wrote:
>>
>>> Hello Shrikar,
>>>
>>> We are still facing read latency issue, here is the histogram
>>> http://pastebin.com/yEvMuHYh
>>>
>>>
>>> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <
>>> apoorva.gau...@myntra.com> wrote:
>>>
 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test
 table, recreating and populating it post which will share the cfhistogram.
 In such case is there any practical limit on the rows I should fetch, for
 e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote:

> Hi Apoorva,
>
> I assume this is the table with studentId and subjectId  as primary
> keys and not other like like marks in that.
>
> create table marks_table(studentId int, subjectId int, marks int,
> PRIMARY KEY(studentId,subjectId));
>
> Also could you give the cfhistogram stats?
>
> nodetool cfhistograms  marks_table;
>
>
>
> Thanks,
> Shrikar
>
>
> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
> apoorva.gau...@myntra.com> wrote:
>
>> Hello All,
>>
>> We've a schema which can be modeled as (studentID, subjectID, marks)
>> where combination of studentID and subjectID is unique. Number of 
>> studentID
>> can go up to 100 million and for each studentID we can have up to  10k
>> subjectIDs.
>>
>> We are using apahce cassandra 2.0.4 and datastax java driver
>> 1.0.4. We are using a four node cluster, each having 24 cores and 32GB
>> memory. I'm sure that the machines are not underperformant as on same 
>> test
>> bed we've consistently received <5ms response times for ~1b documents 
>> when
>> queried via primary key.
>>
>> I've tried three approaches, all of which result in significant
>> deterioration (>500 ms response time) in read query performance once 
>> number
>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>
>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map> int>) and query by subjectID
>>
>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>> KEY(studentID, subjectID) and query as select * from marks_table where
>> studentID = ?
>>
>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>> KEY(studentID, subjectID) and query as select * from marks_table where
>> studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
>> query being ~1K.
>>
>> What can be the bottlenecks. Is it better if we model as (studentID
>> int, subjct_marks_json text) and query by studentID.
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>


 --
 Thanks & Regards,
 Apoorva

>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Apoorva
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Apoorva
>


Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
client side socket limit : 64K
client side maximum connection per host : 8
read consistency level : Quorum


On Thu, Apr 3, 2014 at 12:59 PM, Shrikar archak  wrote:

> How about the client side socket limits? Cassandra client side maximum
> connection per host and read consistency level?
>
> ~Shrikar
>
>
> On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav  > wrote:
>
>> At the client side we are getting a latency of ~350ms, we are using
>> datastax driver 2.0.0 and have kept the fetch size as 500. And these are
>> coming while reading rows having ~200 columns.
>>
>>
>> On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak wrote:
>>
>>> Hi Apoorva,
>>> As per the cfhistogram there are some rows which have more than 75k
>>> columns and around 150k reads hit 2 SStables.
>>>
>>> Are you sure that you are seeing more than 500ms latency?  The
>>> cfhistogram should the worst read performance was around 51ms
>>> which looks reasonable with many reads hitting 2 sstables.
>>>
>>> Thanks,
>>> Shrikar
>>>
>>>
>>> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav <
>>> apoorva.gau...@myntra.com> wrote:
>>>
 Hello Shrikar,

 We are still facing read latency issue, here is the histogram
 http://pastebin.com/yEvMuHYh


 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav <
 apoorva.gau...@myntra.com> wrote:

> Hello Shrikar,
>
> Yes primary key is (studentID, subjectID). I had dropped the test
> table, recreating and populating it post which will share the cfhistogram.
> In such case is there any practical limit on the rows I should fetch, for
> e.g.
> should I do
>select * form marks_table where studentID = ? limit 500;
> instead of doing
>select * form marks_table where studentID = ?;
>
>
> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak 
> wrote:
>
>> Hi Apoorva,
>>
>> I assume this is the table with studentId and subjectId  as primary
>> keys and not other like like marks in that.
>>
>> create table marks_table(studentId int, subjectId int, marks int,
>> PRIMARY KEY(studentId,subjectId));
>>
>> Also could you give the cfhistogram stats?
>>
>> nodetool cfhistograms  marks_table;
>>
>>
>>
>> Thanks,
>> Shrikar
>>
>>
>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav <
>> apoorva.gau...@myntra.com> wrote:
>>
>>> Hello All,
>>>
>>> We've a schema which can be modeled as (studentID, subjectID, marks)
>>> where combination of studentID and subjectID is unique. Number of 
>>> studentID
>>> can go up to 100 million and for each studentID we can have up to  10k
>>> subjectIDs.
>>>
>>> We are using apahce cassandra 2.0.4 and datastax java driver
>>> 1.0.4. We are using a four node cluster, each having 24 cores and 32GB
>>> memory. I'm sure that the machines are not underperformant as on same 
>>> test
>>> bed we've consistently received <5ms response times for ~1b documents 
>>> when
>>> queried via primary key.
>>>
>>> I've tried three approaches, all of which result in significant
>>> deterioration (>500 ms response time) in read query performance once 
>>> number
>>> of subjectIDs goes past ~100 for a studentID. Approaches are :-
>>>
>>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map>> int>) and query by subjectID
>>>
>>> 2. model as (studentID int, subjectID int, marks int, PRIMARY
>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>> studentID = ?
>>>
>>> 3. model as (studentID int, subjectID int, marks int, PRIMARY
>>> KEY(studentID, subjectID) and query as select * from marks_table where
>>> studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
>>> query being ~1K.
>>>
>>> What can be the bottlenecks. Is it better if we model as (studentID
>>> int, subjct_marks_json text) and query by studentID.
>>>
>>> --
>>> Thanks & Regards,
>>> Apoorva
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Apoorva
>



 --
 Thanks & Regards,
 Apoorva

>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Apoorva
>>
>
>


-- 
Thanks & Regards,
Apoorva


EC2 zones, snitches and vnodes

2014-04-03 Thread Alain RODRIGUEZ
Hi,

We are using Cassandra 1.2.11 on AWS EC2 services.

I read that we can use different A-Z to be more "crash tolerant".
Basically, using a RF=3 and placing servers into different zones like
node1-zoneA node2-zoneB node3-zoneC node4-zoneA... As replicas are placed
to the next server in the ring we can manage to have a replica in each zone
and for each replica. So all the data is present in the 3 zones (A, B and
C).

Is there a way to achieve this exact same result using vnodes ? Would it be
possible, using EC2Snitch (or EC2MultiRegionSnitch) and vnodes, to make
sure to spread replicas among available A-Z automatically ?

I know that a second DC would help recovering from a zone outage, but, the
cost is not the same.


Row_key from sstable2json to actual value of the key

2014-04-03 Thread ng
sstable2json tomcat-t5-ic-1-Data.db -e
gives me

0021
001f
0020


How do I convert this (hex) to actual value of column so I can do below

select * from tomcat.t5 where c1='concerted value';

Thanks in advance for the help.


Upgrading Cassandra

2014-04-03 Thread Alain RODRIGUEZ
Hi

As we are using Cassandra 1.2.11 and we will want to move to 2.1 as soon as
it will be released and considered stable enough, we will have to make a
few migrations :

1.2.11   --> 1.2.last (16 currently)
1.2.last -->  2.0.last
2.0.last -->  2.1.last

The point is we ran into a lot of migrations since we started (C* 0.7 -->
0.8 --> 1.0 --> 1.1 --> 1.2). Not even one of our migration went totally
well. Even if we never lost any data, which is a very important thing, we
never upgraded Cassandra smoothly either. We are supposed to be able to do
it through a rolling restart. We never could. We always ran into errors and
ended restarting the whole cluster.

I supposed we are not the only ones in this case since I remember a Spotify
conference "How not to use Cassandra" were they expressed the same kind of
problems.

My question is quite simple:

Since we use AWS EC2 instances, Is it possible to upgrade Cassandra through
a new DC as it is recommended while switching to vnodes ?

(
http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configVnodesProduction_t.html
)

I think that if it is possible, we would make the first minor upgrade
through rolling restart (those one are generally smooth enough) and the 2
major upgrades through "DC migrations".

Alain


Re: Row_key from sstable2json to actual value of the key

2014-04-03 Thread Colin Blower
Hey ng,

You can use CQL and Cassandra do the conversion if you would like. If
your table uses int type keys:
> select * from tomcat.tx where c1 = blobAsInt(0x0021);

The relevant section of the CQL3 docs are here:
http://cassandra.apache.org/doc/cql3/CQL.html#blobFun

You can use blobAs... for any type. I hope this help.


On 04/03/2014 08:50 AM, ng wrote:
> sstable2json tomcat-t5-ic-1-Data.db -e
> gives me
>  
> 0021
> 001f
> 0020
>  
>  
> How do I convert this (hex) to actual value of column so I can do below
>  
> select * from tomcat.t5 where c1='concerted value';
>  
> Thanks in advance for the help.
>


-- 
*Colin Blower*
/Software Engineer/
Barracuda Networks Inc.
+1 408-342-5576 (o)

===

Find out how eSigning generates significant financial benefit.
Read the Barracuda SignNow ROI whitepaper at 
https://signnow.com/l/business/esignature_roi


Re: Upgrading Cassandra

2014-04-03 Thread Robert Coli
  On Thu, Apr 3, 2014 at 8:56 AM, Alain RODRIGUEZ wrote:

> Since we use AWS EC2 instances, Is it possible to upgrade Cassandra
> through a new DC as it is recommended while switching to vnodes ?
>

No, because bootstrapping (and rebuilding/repairing/etc.) on a
split-major-version cluster is not supported.

=Rob


Re: EC2 zones, snitches and vnodes

2014-04-03 Thread Robert Coli
On Thu, Apr 3, 2014 at 8:38 AM, Alain RODRIGUEZ  wrote:

> I read that we can use different A-Z to be more "crash tolerant".
> Basically, using a RF=3 and placing servers into different zones like
> node1-zoneA node2-zoneB node3-zoneC node4-zoneA... As replicas are placed
> to the next server in the ring we can manage to have a replica in each zone
> and for each replica. So all the data is present in the 3 zones (A, B and
> C).
>
> Is there a way to achieve this exact same result using vnodes ? Would it
> be possible, using EC2Snitch (or EC2MultiRegionSnitch) and vnodes, to make
> sure to spread replicas among available A-Z automatically ?
>

https://issues.apache.org/jira/browse/CASSANDRA-4658
https://issues.apache.org/jira/browse/CASSANDRA-4123

And especially...

https://issues.apache.org/jira/browse/CASSANDRA-3810

tl;dr - status quo NTS is probably fine and does more or less what you need
it to do, caveated by 3810. But in theory far future versions of cassandra
might have slightly improved functionality.

=Rob


Re: Read performance in map data type

2014-04-03 Thread Robert Coli
On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav
wrote:

> At the client side we are getting a latency of ~350ms, we are using
> datastax driver 2.0.0 and have kept the fetch size as 500. And these are
> coming while reading rows having ~200 columns.
>

And you're sure that the 300ms between what cassandra reports and what your
app reports are not just network/serialization time?

What do you believe the latency "should" be?

=Rob


Using C* and CAS to coordinate workers

2014-04-03 Thread Jan Algermissen
Hi,

maybe someone knows a nice solution to the following problem:

I have N worker processes that are intentionally masterless and do not know 
about each other - they are stateless and independent instances of a given 
service system.

These workers need to poll an event feed, say about every 10 seconds and 
persist a state after processing the polled events so the next worker knows 
where to continue processing events.

I would like to use C*’s CAS feature to coordinate the workers and protect the 
shared state (a row or cell in  a C* key space, too).

Has anybody done something similar and can suggest a ‘clever’ data model design 
and interaction?



Jan

Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
On Fri, Apr 4, 2014 at 3:32 AM, Robert Coli  wrote:

> On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav  > wrote:
>
>> At the client side we are getting a latency of ~350ms, we are using
>> datastax driver 2.0.0 and have kept the fetch size as 500. And these are
>> coming while reading rows having ~200 columns.
>>
>
> And you're sure that the 300ms between what cassandra reports and what
> your app reports are not just network/serialization time?
>
We are using the datastax 2.0.0 driver. This latency is for the execute
command. We already have the prepared statement cached in app layer before
calling execute.

>
> What do you believe the latency "should" be?
>
If we store the same data as a json using text data type i.e (studentID
int, subjectMarksJson text) we are getting a latency of ~10ms from the same
client for even bigger. I understand that json is not the preferred storage
for cassandra and will loose various flexibility which a proper tabular
approach provides. But such a huge jump in read latency is killer. I'm
pastebin-ing the histogram for json storage as well
http://pastebin.com/RiW6hMb2.

>
> =Rob
>
>



-- 
Thanks & Regards,
Apoorva