Re: Cassandra Performance on a Single Machine

Robert Wille Thu, 14 Jan 2016 12:24:16 -0800

I disagree. I think that you can extrapolate very little information about RF>1 
and CL>1 by benchmarking with RF=1 and CL=1.

On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal
<anur...@berkeley.edu<mailto:anur...@berkeley.edu>> wrote:

Hi John,

Thanks for responding!

The aim of this benchmark was not to benchmark Cassandra as an end-to-end
distributed system, but to understand a break down of the performance. For
instance, if we understand the performance characteristics that we can expect
from a single machine cassandra instance with RF=Consistency=1, we can have a
good estimate of what the distributed performance with higher replication
factors and consistency are going to look like. Even in the ideal case, the
performance improvement would scale at most linearly with more machines and
replicas.

That being said, I still want to understand whether this is the performance I
should expect for the setup I described; if the performance for the current
setup can be improved, then clearly the performance for a production setup
(with multiple nodes, replicas) would also improve. Does that make sense?

Thanks!
Anurag

On Jan 6, 2016, at 9:31 AM, John Schulz
<sch...@pythian.com<mailto:sch...@pythian.com>> wrote:

Anurag,

Unless you are planning on continuing to use only one machine with RF=1
benchmarking a single system using RF=Consistancy=1 is mostly a waste of time.
If you are going to use RF=1 and a single host then why use Cassandra at all.
Plain old relational dbs should do the job just fine.

Cassandra is designed to be distributed. You won't get the full impact of how
it scales and the limits on scaling unless you benchmark a distributed system.
For example the scaling impact of secondary indexes will not be visible on a
single node.

John

On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal
<anur...@berkeley.edu<mailto:anur...@berkeley.edu>> wrote:
Hi,

I’ve been benchmarking Cassandra to get an idea of how the performance scales
with more data on a single machine. I just wanted to get some feedback to
whether these are the numbers I should expect.

The benchmarks are quite simple — I measure the latency and throughput for two
kinds of queries:

1. get() queries - These fetch an entire row for a given primary key.
2. search() queries - These fetch all the primary keys for rows where a
particular column matches a particular value (e.g., “name” is “John Smith”).

Indexes are constructed for all columns that are queried.

Dataset

The dataset used comprises of ~1.5KB records (on an average) when represented
as CSV; there are 105 attributes in each record.

Queries

For get() queries, randomly generated primary keys are used.

For search() queries, column values are selected such that their total number
of occurrences in the dataset is between 1 - 4000. For example, a query for
“name” = “John Smith” would only be performed if the number of rows that
contain the same lies between 1-4000.

The results for the benchmarks are provided below:

Latency Measurements

The latency measurements are an average of 10000 queries.

Throughput Measurements

The throughput measurements were repeated for 1-16 client threads, and the
numbers reported for each input size is for the configuration (i.e., # client
threads) with the highest throughput.

Any feedback here would be greatly appreciated!

Thanks!
Anurag

John H. Schulz

Principal Consultant

Pythian - Love your data

sch...@pythian.com<mailto:sch...@pythian.com> | Linkedin
www.linkedin.com/pub/john-schulz/13/ab2/930/<http://www.linkedin.com/pub/john-schulz/13/ab2/930/>

Mobile: 248-376-3380

Re: Cassandra Performance on a Single Machine

Reply via email to