Hi Jack,
> So, your 1GB input size means roughly 716 thousand rows of data and 128GB
> means roughly 92 million rows, correct?
Yes, that's correct.
> Are your gets and searches returning single rows, or a significant number of
> rows?
Like I mentioned in my first email, get always returns a s
Thanks for that clarification.
So, your 1GB input size means roughly 716 thousand rows of data and 128GB
means roughly 92 million rows, correct?
FWIW, a best practice recommendation is that you avoid using secondary
indexes in favor of using "query tables" - store the same data in multiple
tables
To clarify: Input size is the size of the dataset as a CSV file, before loading
it into Cassandra; for each input size, the number of columns is fixed but the
number of rows is different. By 1.5KB record, I meant that each row, when
represented as a CSV entry, occupies 1500 bytes. I've used the
What exactly is "input size" here (1GB to 128GB)? I mean, the test spec "The
dataset used comprises of ~1.5KB records... there are 105 attributes in
each record." Does each test run have exactly the same number of rows and
columns and you're just making each column bigger, or what?
Cassandra does
I think you actually get a really useful metric by benchmarking 1 machine.
You understand your cluster's theoretical maximum performance, which would
be Nodes * number of queries. Yes, adding in replication and CL is
important, but 1 machine lets you isolate certain performance metrics.
On Thu, J
I disagree. I think that you can extrapolate very little information about RF>1
and CL>1 by benchmarking with RF=1 and CL=1.
On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal
mailto:anur...@berkeley.edu>> wrote:
Hi John,
Thanks for responding!
The aim of this benchmark was not to benchmark Cassa
Hi John,
Thanks for responding!
The aim of this benchmark was not to benchmark Cassandra as an end-to-end
distributed system, but to understand a break down of the performance. For
instance, if we understand the performance characteristics that we can expect
from a single machine cassandra ins
Anurag,
Unless you are planning on continuing to use only one machine with RF=1
benchmarking a single system using RF=Consistancy=1 is mostly a waste of
time. If you are going to use RF=1 and a single host then why use Cassandra
at all. Plain old relational dbs should do the job just fine.
Cassan
John,
Yep that makes perfect sense. Thank you for your time I appreciate it!
From: John Anderstedt [mailto:john.anderst...@svenskaspel.se]
Sent: Friday, January 24, 2014 9:08 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra Performance Testing
It sounds to me that the limitation in this
It sounds to me that the limitation in this setup is the disks.
if it’s in a mirror the cost for write’s is the dubble.
If you have the flatfile and the db on the same disk there will be a lot of io
wait.
There is also a question of diskspace and fragmentation, if the flat file
occupies 1,2TB o
Thanks all for your responses. We've downgraded from 2.0.3 to 2.0.0 and
everything became normal.
2013/12/8 Nate McCall
> If you are really set on using Cassandra as a cache, I would recommend
> disabling durable writes for the keyspace(s)[0]. This will bypass the
> commitlog (the flushing/rota
If you are really set on using Cassandra as a cache, I would recommend
disabling durable writes for the keyspace(s)[0]. This will bypass the
commitlog (the flushing/rotation of which my be a good-sized portion of
your performance problems given the number of tables).
[0]
http://www.datastax.com/do
On Thu, Dec 5, 2013 at 6:33 AM, Alexander Shutyaev wrote:
> We've plugged it into our production environment as a cache in front of
> postgres. Everything worked fine, we even stressed it by explicitly
> propagating about 30G (10G/node) data from postgres to cassandra.
>
If you just want a cachin
Thanks for your answers,
Jonathan, yes it was load avg and iowait was lower than 2% all that time -
the only load was the user one.
Robert, we had -Xmx4012m which was automatically calculated by the default
cassandra-env.sh (1/4 of total memory - 16G) - we didn't change that.
2013/12/5 Robert C
On Thu, Dec 5, 2013 at 4:33 AM, Alexander Shutyaev wrote:
> Cassandra version is 2.0.3. ... We've plugged it into our production
> environment as a cache in front of postgres.
>
https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
> What can be the reason? Can it be high n
Do you mean high CPU usage or high load avg? (20 indicates load avg to
me). High load avg means the CPU is waiting on something.
Check "iostat -dmx 1 100" to check your disk stats, you'll see the columns
that indicate mb/s read & write as well as % utilization.
Once you understand the bottlenec
You should be able to set the key_validation_class on the column family to
use a different data type for the row keys. You may not be able to change
this for a CF with existing data without some troubles due to a mismatch of
data types; if that's a concern you'll have to create a separate CF and
m
Thanks all for the help.
I ran the traffic over the weekend surprisingly, my heap was doing OK
(around 5.7G of 8G) but GC activity went nuts and dropped the throughput. I
will probably increase the number of nodes.
The other interesting thing I noticed was that there were some objects with
finaliz
I believe you should roll out more nodes as a temporary fix to your problem,
400GB on all nodes means (as correctly mentioned in other mails of this thread)
you are spending more time on GC. Check out the second comment in this link by
Aaron Morton, he says the more than 300GB can be problematic
One or more of these might be effective depending on your particular usage
- remove data (rows especially)
- add nodes
- add ram (has limitations)
- reduce bloom filter space used by increasing fp chance
- reduce row and key cache sizes
- increase index sample ratio
- reduce compaction concurrency
You are right, it looks like I am doing a lot of GC. Is there any
short-term solution for this other than bumping up the heap ? because, even
if I increase the heap I will run into the same issue. Only the time before
I hit OOM will be lengthened.
It will be while before we go to latest and greate
Sounds like you're spending all your time in GC, which you can verify
by checking what GCInspector and StatusLogger say in the log.
Fix is increase your heap size or upgrade to 1.2:
http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2
On Wed, May 29, 2013 at 11:32 PM, srmore
> "select CPUTime,User,site from CF(or tablename) where user=xxx and
> Jobtype=xxx"
Even thought cassandra has tables and looks like a RDBMS it's not.
Queries with multiple secondary index clauses will not perform as well as those
with none.
There is plenty of documentation here http://www.da
Biggest advantage of Cassandra is it's ability to scale linearly as more
nodes are added and it's ability to handle node failures.
Also to get the maximum performance from Cassandra you need to be making
multiple requests in parallel.
On Sun, Mar 24, 2013 at 3:15 AM, 张刚 wrote:
> Hello,
> I am
For example,each row represent a job record,it has fields like
"user","site","CPUTime","datasize","JobType"...
The fields in CF is fixed,just like a table.The query like this "select
CPUTime,User,site from CF(or tablename) where user=xxx and Jobtype=xxx"
Best regards
2013/3/24 cem
> Hi,
>
> Co
Hi,
Could you provide some other details about your schema design and queries?
It is very hard to tell anything.
Regards,
Cem
On Sun, Mar 24, 2013 at 12:40 PM, dong.yajun wrote:
> Hello,
>
> I'd suggest you to take look at the difference between Nosql and RDMS.
>
> Best,
>
> On Sun, Mar 24, 2
Hello,
I'd suggest you to take look at the difference between Nosql and RDMS.
Best,
On Sun, Mar 24, 2013 at 5:15 PM, 张刚 wrote:
> Hello,
> I am new to Cassandra.I do some test on a single machine. I install
> Cassandra with a binary tarball distribution.
> I create a CF to store the data that
Hi,
Thanks for the information..
I upgraded my cassandra version to 1.2.0 and tried running the
experiment again to find the statistics.
My application took nearly 529 seconds for querying 76896 keys.
Please find the statistic information below for 32 threads ( where
each thread query 76896 key
You can also see what it looks like from the server side.
nodetool proxyhistograms will show you full request latency recorded by the
coordinator.
nodetool cfhistograms will show you the local read latency, this is just the
time it takes to read data on a replica and does not include network o
The fact that it's still exactly 521 seconds is very suspicious. I can't
debug your script over the mailing list, but do some sanity checks to make
sure there's not a bottleneck somewhere you don't expect.
On Fri, Jan 18, 2013 at 12:44 PM, Pradeep Kumar Mantha wrote:
> Hi,
>
> Thanks Tyler.
>
Hi,
Thanks Tyler.
Below is the *global* connection pool I am trying to use, where the
server_list contains all the ips of 12 DataNodes I am using and
pool_size is the number of threads and I just set to timeout to 60 to
avoid connection retry errors.
pool = pycassa.ConnectionPool('Blast',
serve
You just need to increase the ConnectionPool size to handle the number of
threads you have using it concurrently. Set the pool_size kwarg to at
least the number of threads you're using.
On Thu, Jan 17, 2013 at 6:46 PM, Pradeep Kumar Mantha
wrote:
> Thanks Tyler.
>
> I just moved the pool and cf
Thanks Tyler.
I just moved the pool and cf which store the connection pool and CF
information to have global scope.
Increased the server_list values from 1 to 4. ( i think i can increase
them max to 12 since I have 12 data nodes )
when I created 8 threads using python threading package , I see
ConnectionPools and ColumnFamilies are thread-safe in pycassa, and it's
best to share them across multiple threads. Of course, when you do that,
make sure to make the ConnectionPool large enough to support all of the
threads making queries concurrently. I'm also not sure if you're just
omitting t
Hi,
Thanks. I would like to benchmark cassandra with our application so
that we understand the details of how the actual benchmarking is done.
Not sure, how easy it would be to integrate YCSB with our application.
So, i am trying different client interfaces to cassandra.
I found
for 12 Data Nod
Wow you managed to do a load test through the cassandra-cli. There should
be a merit badge for that.
You should use the built in stress tool or YCSB.
The CLI has to do much more string conversion then a normal client would
and it is not built for performance. You will definitely get better number
Now that would be cool. Right now though, to many other features need
to be added like a GUI on top of the ad-hoc query tool is the next top
priority so one can do any S-SQL statement and ad-hoc query the heck out
of a noSQL store.
We may even be able to optimize our queries to be even faster
Try to get Cassandra running the TPH-C benchmarks and beat oracle :)
On Fri, Sep 7, 2012 at 10:01 AM, Hiller, Dean wrote:
> So we wrote 1,000,000 rows into cassandra and ran a simple S-SQL(Scalable
> SQL) query of
>
>
> PARTITIONS n(:partition) SELECT n FROM TABLE as n WHERE n.numShares >= :low
No argument there. Thanks for explaining what you were doing to
encrypt client traffic!
On Mon, Jan 23, 2012 at 10:11 PM, Chris Marino wrote:
> Hi Jonathan, yes, when I say 'node encryption' I mean inter-Cassandra node
> encryption. When I say 'client encryption' I mean encrypted traffic from th
Hi Jonathan, yes, when I say 'node encryption' I mean inter-Cassandra node
encryption. When I say 'client encryption' I mean encrypted traffic from
the Cassandra nodes to the clients. For these benchmarks we used the stress
test client load generator.
We ran test with no encryption, then with 'nod
Can you elaborate on to what exactly you were testing on the Cassandra
side? It sounds like what this post refers to as "node" encryption
corresponds to enabling "internode_encryption: all", but I couldn't
guess what your client encryption is since Cassandra doesn't support
that out of the box yet
sweet, that's pretty awesome :)
On Fri, Dec 30, 2011 at 8:08 PM, Jeremy Hanna wrote:
> This might be helpful:
> http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
>
> On Dec 30, 2011, at 1:59 PM, Dom Wong wrote:
>
> > Hi, could anyone tell me whether this is possible w
We did some benchmarking as well.
http://blog.vcider.com/2011/09/virtual-networks-can-run-cassandra-up-to-60-faster/
Although we were primarily interested in the networking issues
CM
On Fri, Dec 30, 2011 at 12:08 PM, Jeremy Hanna
wrote:
> This might be helpful:
> http://techblog.netflix.c
This might be helpful:
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
On Dec 30, 2011, at 1:59 PM, Dom Wong wrote:
> Hi, could anyone tell me whether this is possible with Cassandra using an
> appropriately sized EC2 cluster.
>
> 100,000 clients writing 50k each
On Mon, Oct 3, 2011 at 1:19 PM, Ramesh Natarajan wrote:
> Thanks for the pointers. I checked the system and the iostat showed that we
> are saturating the disk to 100%. The disk is SCSI device exposed by ESXi and
> it is running on a dedicated lun as RAID10 (4 600GB 15k drives) connected to
> ESX
Yes look at cassandra.yaml there is a section about throttling compaction.
You still *want* multi-threaded compaction. Throttling will occur across all
threads. The reason being is that you don't want to get stuck compacting
bigger files, while the smaller ones build up waiting for bigger compactio
Thanks for the pointers. I checked the system and the iostat showed that we
are saturating the disk to 100%. The disk is SCSI device exposed by ESXi and
it is running on a dedicated lun as RAID10 (4 600GB 15k drives) connected to
ESX host via iSCSI.
When I run compactionstats I see we are compact
Most likely what could be happening is you are running single threaded
compaction. Look at the cassandra.yaml of how to enable multi-threaded
compaction. As more data comes into the system, bigger files get created
during compaction. You could be in a situation where you might be compacting
at a hi
In order to understand what's going on you might want to first just do
write test, look at the results and then do just the read tests and
then do both read / write tests.
Since you mentioned high update/deletes I should also ask your CL for
writes/reads? with high updates/delete + high CL I think
I will start another test run to collect these stats. Our test model is in
the neighborhood of 4500 inserts, 8000 updates&deletes and 1500 reads every
second across 6 servers.
Can you elaborate more on reducing the heap space? Do you think it is a
problem with 17G RSS?
thanks
Ramesh
On Mon, Oc
I am wondering if you are seeing issues because of more frequent
compactions kicking in. Is this primarily write ops or reads too?
During the period of test gather data like:
1. cfstats
2. tpstats
3. compactionstats
4. netstats
5. iostat
You have RSS memory close to 17gb. Maybe someone can give f
maybe try row cache ?
have you enabled the mlock ? (need jna.jar , and set ulimit -l )
using iostat -x would also give you more clues as to disk performance
On Mon, Oct 3, 2011 at 10:12 AM, Ramesh Natarajan wrote:
> I am running a cassandra cluster of 6 nodes running RHEL6 virtualized by
> ESX
We have 5 CF. Attached is the output from the describe command. We don't
have row cache enabled.
Thanks
Ramesh
Keyspace: MSA:
Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Durable Writes: true
Options: [replication_factor:3]
Column Families:
ColumnFamily: admin
On Mon, Oct 3, 2011 at 10:12 AM, Ramesh Natarajan wrote:
> I am running a cassandra cluster of 6 nodes running RHEL6 virtualized by
> ESXi 5.0. Each VM is configured with 20GB of ram and 12 cores. Our test
> setup performs about 3000 inserts per second. The cassandra data partition
> is on a X
On Sat, Sep 18, 2010 at 9:26 AM, Peter Schuller
wrote:
>> - performance (it should be not as much less than shard of MySQL and
>> scale linearly, we want to have not more that 10K inserts per second
>> of writes, and probably not more than 1K/s reads which will be mostly
>> random)
>> - ability
> - performance (it should be not as much less than shard of MySQL and
> scale linearly, we want to have not more that 10K inserts per second
> of writes, and probably not more than 1K/s reads which will be mostly
> random)
> - ability to store big amounts of data (now it looks that we will
> hav
Hi,
first of all I am not Cassandra hater :) I do not expect miracles also
:) I'm searching if there is any scalable solution which could have be
used instead of sharding solution over MySQL or Tokyo Tyrant. Our
system now runs OK on single Tokyo Tyrant DB but we expect a lot of
traffic increase i
> Disabling row cache in this case makes sense, but disabling key cache
> is probably hurting your performance quite a bit. If you wrote 20GB
> of data per node, with narrow rows as you describe, and had default
> memtable settings, you now have a huge number of sstables on disk.
> You did not ind
It appears you are doing several things that assure terrible
performance, so I am not surprised you are getting it.
On Tue, Sep 14, 2010 at 3:40 PM, Kamil Gorlo wrote:
> My main tool was stress.py for benchmarks (or equivalent written in
> C++ to deal with python2.5 lack of multiprocessing). I wi
> durable and rich data model. It will not provide your high performance,
> especially reading performance is poor.
Note that for several realistic work-loads, the above claim is most
definitely wrong. For example, for large databases with a mix of
insertions/deletions (so that the MySQL case doe
http://www.quora.com/Is-Cassandra-to-blame-for-Digg-v4s-technical-failures
On Sep 17, 2010, at 4:35 PM, Zhong Li wrote:
> This is my personal experiences. MySQL is faster than Cassandra on most
> normal use cases.
>
> You should understand why you choose Cassandra instead of MySQL. If one
>
This is my personal experiences. MySQL is faster than Cassandra on
most normal use cases.
You should understand why you choose Cassandra instead of MySQL. If
one central MySQL can handle your workload, MySQL is better than
Cassandra. BUT if you are overload one MySQL and want multiple boxes
If MySQL is faster then use it. I struggled to do side by side comparisons
with Mysql for months until finally realizing they are too different to do
side by side comparisons. Mysql is always faster out of the gate when you
come at the problem thinking in terms of relational databases. Add in
repli
> But to be honest I'm pretty disappointed that Cassandra doesn't really
> scale linearly (or "semi-linearly" :)) when adding new machines. I
It really should scale linearly for this workload unless I have missed
something important (in which case I hope someone will chime in). But
note that you a
Kamil Gorlo gmail.com> writes:
>
> So I've got more reads from single MySQL with 400GB of data than from
> 8 machines storing about 266GB. This doesn't look good. What am I
> doing wrong? :)
The worst case for cassandra is random reads. You should ask youself a question,
do you really have this
Hello,
On Wed, Sep 15, 2010 at 3:53 AM, Jonathan Ellis wrote:
> The key is that while Cassandra may read less rows per second than
> MySQL when you are i/o bound (as you are here) because of SSTable
> merging (see http://wiki.apache.org/cassandra/MemtableSSTable), you
> should be using your Cassa
Hello,
On Wed, Sep 15, 2010 at 3:45 AM, Chen Xinli wrote:
[cut]
>>
> Disable row cache is ok, but key cache should be enabled. It use little
> memory, but reading peformance will improve a lot.
Hmm, I've tested with key cache enabled (100%) and I am pretty sure
that this really doesn't help si
The key is that while Cassandra may read less rows per second than
MySQL when you are i/o bound (as you are here) because of SSTable
merging (see http://wiki.apache.org/cassandra/MemtableSSTable), you
should be using your Cassandra rows as materialized views so that each
query is a single row looku
2010/9/15 Kamil Gorlo
> Hey,
>
> we are considering using Cassandra for quite large project and because
> of that I made some tests with Cassandra. I was testing performance
> and stability mainly.
>
> My main tool was stress.py for benchmarks (or equivalent written in
> C++ to deal with python2.
69 matches
Mail list logo