Re: Read performance in map data type
Hi Apoorva, As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables. Are you sure that you are seeing more than 500ms latency? The cfhistogram should the worst read performance was around 51ms which looks reasonable with many reads hitting 2 sstables. Thanks, Shrikar On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav wrote: > Hello Shrikar, > > We are still facing read latency issue, here is the histogram > http://pastebin.com/yEvMuHYh > > > On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav > wrote: > >> Hello Shrikar, >> >> Yes primary key is (studentID, subjectID). I had dropped the test table, >> recreating and populating it post which will share the cfhistogram. In such >> case is there any practical limit on the rows I should fetch, for e.g. >> should I do >>select * form marks_table where studentID = ? limit 500; >> instead of doing >>select * form marks_table where studentID = ?; >> >> >> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote: >> >>> Hi Apoorva, >>> >>> I assume this is the table with studentId and subjectId as primary keys >>> and not other like like marks in that. >>> >>> create table marks_table(studentId int, subjectId int, marks int, >>> PRIMARY KEY(studentId,subjectId)); >>> >>> Also could you give the cfhistogram stats? >>> >>> nodetool cfhistograms marks_table; >>> >>> >>> >>> Thanks, >>> Shrikar >>> >>> >>> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < >>> apoorva.gau...@myntra.com> wrote: >>> Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received <5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (>500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map>>> int>) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks & Regards, Apoorva >>> >>> >> >> >> -- >> Thanks & Regards, >> Apoorva >> > > > > -- > Thanks & Regards, > Apoorva >
Re: Read performance in map data type
At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns. On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak wrote: > Hi Apoorva, > As per the cfhistogram there are some rows which have more than 75k > columns and around 150k reads hit 2 SStables. > > Are you sure that you are seeing more than 500ms latency? The cfhistogram > should the worst read performance was around 51ms > which looks reasonable with many reads hitting 2 sstables. > > Thanks, > Shrikar > > > On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav > wrote: > >> Hello Shrikar, >> >> We are still facing read latency issue, here is the histogram >> http://pastebin.com/yEvMuHYh >> >> >> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav < >> apoorva.gau...@myntra.com> wrote: >> >>> Hello Shrikar, >>> >>> Yes primary key is (studentID, subjectID). I had dropped the test table, >>> recreating and populating it post which will share the cfhistogram. In such >>> case is there any practical limit on the rows I should fetch, for e.g. >>> should I do >>>select * form marks_table where studentID = ? limit 500; >>> instead of doing >>>select * form marks_table where studentID = ?; >>> >>> >>> On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote: >>> Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < apoorva.gau...@myntra.com> wrote: > Hello All, > > We've a schema which can be modeled as (studentID, subjectID, marks) > where combination of studentID and subjectID is unique. Number of > studentID > can go up to 100 million and for each studentID we can have up to 10k > subjectIDs. > > We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We > are using a four node cluster, each having 24 cores and 32GB memory. I'm > sure that the machines are not underperformant as on same test bed we've > consistently received <5ms response times for ~1b documents when queried > via primary key. > > I've tried three approaches, all of which result in significant > deterioration (>500 ms response time) in read query performance once > number > of subjectIDs goes past ~100 for a studentID. Approaches are :- > > 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map int>) and query by subjectID > > 2. model as (studentID int, subjectID int, marks int, PRIMARY > KEY(studentID, subjectID) and query as select * from marks_table where > studentID = ? > > 3. model as (studentID int, subjectID int, marks int, PRIMARY > KEY(studentID, subjectID) and query as select * from marks_table where > studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in > query being ~1K. > > What can be the bottlenecks. Is it better if we model as (studentID > int, subjct_marks_json text) and query by studentID. > > -- > Thanks & Regards, > Apoorva > >>> >>> >>> -- >>> Thanks & Regards, >>> Apoorva >>> >> >> >> >> -- >> Thanks & Regards, >> Apoorva >> > > -- Thanks & Regards, Apoorva
Re: Read performance in map data type
How about the client side socket limits? Cassandra client side maximum connection per host and read consistency level? ~Shrikar On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav wrote: > At the client side we are getting a latency of ~350ms, we are using > datastax driver 2.0.0 and have kept the fetch size as 500. And these are > coming while reading rows having ~200 columns. > > > On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak wrote: > >> Hi Apoorva, >> As per the cfhistogram there are some rows which have more than 75k >> columns and around 150k reads hit 2 SStables. >> >> Are you sure that you are seeing more than 500ms latency? The >> cfhistogram should the worst read performance was around 51ms >> which looks reasonable with many reads hitting 2 sstables. >> >> Thanks, >> Shrikar >> >> >> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav < >> apoorva.gau...@myntra.com> wrote: >> >>> Hello Shrikar, >>> >>> We are still facing read latency issue, here is the histogram >>> http://pastebin.com/yEvMuHYh >>> >>> >>> On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav < >>> apoorva.gau...@myntra.com> wrote: >>> Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak wrote: > Hi Apoorva, > > I assume this is the table with studentId and subjectId as primary > keys and not other like like marks in that. > > create table marks_table(studentId int, subjectId int, marks int, > PRIMARY KEY(studentId,subjectId)); > > Also could you give the cfhistogram stats? > > nodetool cfhistograms marks_table; > > > > Thanks, > Shrikar > > > On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < > apoorva.gau...@myntra.com> wrote: > >> Hello All, >> >> We've a schema which can be modeled as (studentID, subjectID, marks) >> where combination of studentID and subjectID is unique. Number of >> studentID >> can go up to 100 million and for each studentID we can have up to 10k >> subjectIDs. >> >> We are using apahce cassandra 2.0.4 and datastax java driver >> 1.0.4. We are using a four node cluster, each having 24 cores and 32GB >> memory. I'm sure that the machines are not underperformant as on same >> test >> bed we've consistently received <5ms response times for ~1b documents >> when >> queried via primary key. >> >> I've tried three approaches, all of which result in significant >> deterioration (>500 ms response time) in read query performance once >> number >> of subjectIDs goes past ~100 for a studentID. Approaches are :- >> >> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map> int>) and query by subjectID >> >> 2. model as (studentID int, subjectID int, marks int, PRIMARY >> KEY(studentID, subjectID) and query as select * from marks_table where >> studentID = ? >> >> 3. model as (studentID int, subjectID int, marks int, PRIMARY >> KEY(studentID, subjectID) and query as select * from marks_table where >> studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in >> query being ~1K. >> >> What can be the bottlenecks. Is it better if we model as (studentID >> int, subjct_marks_json text) and query by studentID. >> >> -- >> Thanks & Regards, >> Apoorva >> > > -- Thanks & Regards, Apoorva >>> >>> >>> >>> -- >>> Thanks & Regards, >>> Apoorva >>> >> >> > > > -- > Thanks & Regards, > Apoorva >
Re: Read performance in map data type
client side socket limit : 64K client side maximum connection per host : 8 read consistency level : Quorum On Thu, Apr 3, 2014 at 12:59 PM, Shrikar archak wrote: > How about the client side socket limits? Cassandra client side maximum > connection per host and read consistency level? > > ~Shrikar > > > On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav > wrote: > >> At the client side we are getting a latency of ~350ms, we are using >> datastax driver 2.0.0 and have kept the fetch size as 500. And these are >> coming while reading rows having ~200 columns. >> >> >> On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak wrote: >> >>> Hi Apoorva, >>> As per the cfhistogram there are some rows which have more than 75k >>> columns and around 150k reads hit 2 SStables. >>> >>> Are you sure that you are seeing more than 500ms latency? The >>> cfhistogram should the worst read performance was around 51ms >>> which looks reasonable with many reads hitting 2 sstables. >>> >>> Thanks, >>> Shrikar >>> >>> >>> On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav < >>> apoorva.gau...@myntra.com> wrote: >>> Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav < apoorva.gau...@myntra.com> wrote: > Hello Shrikar, > > Yes primary key is (studentID, subjectID). I had dropped the test > table, recreating and populating it post which will share the cfhistogram. > In such case is there any practical limit on the rows I should fetch, for > e.g. > should I do >select * form marks_table where studentID = ? limit 500; > instead of doing >select * form marks_table where studentID = ?; > > > On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak > wrote: > >> Hi Apoorva, >> >> I assume this is the table with studentId and subjectId as primary >> keys and not other like like marks in that. >> >> create table marks_table(studentId int, subjectId int, marks int, >> PRIMARY KEY(studentId,subjectId)); >> >> Also could you give the cfhistogram stats? >> >> nodetool cfhistograms marks_table; >> >> >> >> Thanks, >> Shrikar >> >> >> On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav < >> apoorva.gau...@myntra.com> wrote: >> >>> Hello All, >>> >>> We've a schema which can be modeled as (studentID, subjectID, marks) >>> where combination of studentID and subjectID is unique. Number of >>> studentID >>> can go up to 100 million and for each studentID we can have up to 10k >>> subjectIDs. >>> >>> We are using apahce cassandra 2.0.4 and datastax java driver >>> 1.0.4. We are using a four node cluster, each having 24 cores and 32GB >>> memory. I'm sure that the machines are not underperformant as on same >>> test >>> bed we've consistently received <5ms response times for ~1b documents >>> when >>> queried via primary key. >>> >>> I've tried three approaches, all of which result in significant >>> deterioration (>500 ms response time) in read query performance once >>> number >>> of subjectIDs goes past ~100 for a studentID. Approaches are :- >>> >>> 1. model as (studentID int PRIMARY KEY, subjectID_marks_map map>> int>) and query by subjectID >>> >>> 2. model as (studentID int, subjectID int, marks int, PRIMARY >>> KEY(studentID, subjectID) and query as select * from marks_table where >>> studentID = ? >>> >>> 3. model as (studentID int, subjectID int, marks int, PRIMARY >>> KEY(studentID, subjectID) and query as select * from marks_table where >>> studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in >>> query being ~1K. >>> >>> What can be the bottlenecks. Is it better if we model as (studentID >>> int, subjct_marks_json text) and query by studentID. >>> >>> -- >>> Thanks & Regards, >>> Apoorva >>> >> >> > > > -- > Thanks & Regards, > Apoorva > -- Thanks & Regards, Apoorva >>> >>> >> >> >> -- >> Thanks & Regards, >> Apoorva >> > > -- Thanks & Regards, Apoorva
EC2 zones, snitches and vnodes
Hi, We are using Cassandra 1.2.11 on AWS EC2 services. I read that we can use different A-Z to be more "crash tolerant". Basically, using a RF=3 and placing servers into different zones like node1-zoneA node2-zoneB node3-zoneC node4-zoneA... As replicas are placed to the next server in the ring we can manage to have a replica in each zone and for each replica. So all the data is present in the 3 zones (A, B and C). Is there a way to achieve this exact same result using vnodes ? Would it be possible, using EC2Snitch (or EC2MultiRegionSnitch) and vnodes, to make sure to spread replicas among available A-Z automatically ? I know that a second DC would help recovering from a zone outage, but, the cost is not the same.
Row_key from sstable2json to actual value of the key
sstable2json tomcat-t5-ic-1-Data.db -e gives me 0021 001f 0020 How do I convert this (hex) to actual value of column so I can do below select * from tomcat.t5 where c1='concerted value'; Thanks in advance for the help.
Upgrading Cassandra
Hi As we are using Cassandra 1.2.11 and we will want to move to 2.1 as soon as it will be released and considered stable enough, we will have to make a few migrations : 1.2.11 --> 1.2.last (16 currently) 1.2.last --> 2.0.last 2.0.last --> 2.1.last The point is we ran into a lot of migrations since we started (C* 0.7 --> 0.8 --> 1.0 --> 1.1 --> 1.2). Not even one of our migration went totally well. Even if we never lost any data, which is a very important thing, we never upgraded Cassandra smoothly either. We are supposed to be able to do it through a rolling restart. We never could. We always ran into errors and ended restarting the whole cluster. I supposed we are not the only ones in this case since I remember a Spotify conference "How not to use Cassandra" were they expressed the same kind of problems. My question is quite simple: Since we use AWS EC2 instances, Is it possible to upgrade Cassandra through a new DC as it is recommended while switching to vnodes ? ( http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configVnodesProduction_t.html ) I think that if it is possible, we would make the first minor upgrade through rolling restart (those one are generally smooth enough) and the 2 major upgrades through "DC migrations". Alain
Re: Row_key from sstable2json to actual value of the key
Hey ng, You can use CQL and Cassandra do the conversion if you would like. If your table uses int type keys: > select * from tomcat.tx where c1 = blobAsInt(0x0021); The relevant section of the CQL3 docs are here: http://cassandra.apache.org/doc/cql3/CQL.html#blobFun You can use blobAs... for any type. I hope this help. On 04/03/2014 08:50 AM, ng wrote: > sstable2json tomcat-t5-ic-1-Data.db -e > gives me > > 0021 > 001f > 0020 > > > How do I convert this (hex) to actual value of column so I can do below > > select * from tomcat.t5 where c1='concerted value'; > > Thanks in advance for the help. > -- *Colin Blower* /Software Engineer/ Barracuda Networks Inc. +1 408-342-5576 (o) === Find out how eSigning generates significant financial benefit. Read the Barracuda SignNow ROI whitepaper at https://signnow.com/l/business/esignature_roi
Re: Upgrading Cassandra
On Thu, Apr 3, 2014 at 8:56 AM, Alain RODRIGUEZ wrote: > Since we use AWS EC2 instances, Is it possible to upgrade Cassandra > through a new DC as it is recommended while switching to vnodes ? > No, because bootstrapping (and rebuilding/repairing/etc.) on a split-major-version cluster is not supported. =Rob
Re: EC2 zones, snitches and vnodes
On Thu, Apr 3, 2014 at 8:38 AM, Alain RODRIGUEZ wrote: > I read that we can use different A-Z to be more "crash tolerant". > Basically, using a RF=3 and placing servers into different zones like > node1-zoneA node2-zoneB node3-zoneC node4-zoneA... As replicas are placed > to the next server in the ring we can manage to have a replica in each zone > and for each replica. So all the data is present in the 3 zones (A, B and > C). > > Is there a way to achieve this exact same result using vnodes ? Would it > be possible, using EC2Snitch (or EC2MultiRegionSnitch) and vnodes, to make > sure to spread replicas among available A-Z automatically ? > https://issues.apache.org/jira/browse/CASSANDRA-4658 https://issues.apache.org/jira/browse/CASSANDRA-4123 And especially... https://issues.apache.org/jira/browse/CASSANDRA-3810 tl;dr - status quo NTS is probably fine and does more or less what you need it to do, caveated by 3810. But in theory far future versions of cassandra might have slightly improved functionality. =Rob
Re: Read performance in map data type
On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav wrote: > At the client side we are getting a latency of ~350ms, we are using > datastax driver 2.0.0 and have kept the fetch size as 500. And these are > coming while reading rows having ~200 columns. > And you're sure that the 300ms between what cassandra reports and what your app reports are not just network/serialization time? What do you believe the latency "should" be? =Rob
Using C* and CAS to coordinate workers
Hi, maybe someone knows a nice solution to the following problem: I have N worker processes that are intentionally masterless and do not know about each other - they are stateless and independent instances of a given service system. These workers need to poll an event feed, say about every 10 seconds and persist a state after processing the polled events so the next worker knows where to continue processing events. I would like to use C*’s CAS feature to coordinate the workers and protect the shared state (a row or cell in a C* key space, too). Has anybody done something similar and can suggest a ‘clever’ data model design and interaction? Jan
Re: Read performance in map data type
On Fri, Apr 4, 2014 at 3:32 AM, Robert Coli wrote: > On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav > wrote: > >> At the client side we are getting a latency of ~350ms, we are using >> datastax driver 2.0.0 and have kept the fetch size as 500. And these are >> coming while reading rows having ~200 columns. >> > > And you're sure that the 300ms between what cassandra reports and what > your app reports are not just network/serialization time? > We are using the datastax 2.0.0 driver. This latency is for the execute command. We already have the prepared statement cached in app layer before calling execute. > > What do you believe the latency "should" be? > If we store the same data as a json using text data type i.e (studentID int, subjectMarksJson text) we are getting a latency of ~10ms from the same client for even bigger. I understand that json is not the preferred storage for cassandra and will loose various flexibility which a proper tabular approach provides. But such a huge jump in read latency is killer. I'm pastebin-ing the histogram for json storage as well http://pastebin.com/RiW6hMb2. > > =Rob > > -- Thanks & Regards, Apoorva