Hi Jack,

> So, your 1GB input size means roughly 716 thousand rows of data and 128GB 
> means roughly 92 million rows, correct?

Yes, that's correct.

> Are your gets and searches returning single rows, or a significant number of 
> rows?

Like I mentioned in my first email, get always returns a single row, and search 
returns variable number of rows. The number of rows returned varies from 1-4000.

> -- Jack Krupansky
> 
>> On Thu, Jan 14, 2016 at 4:43 PM, Anurag Khandelwal <anur...@berkeley.edu> 
>> wrote:
>> To clarify: Input size is the size of the dataset as a CSV file, before 
>> loading it into Cassandra; for each input size, the number of columns is 
>> fixed but the number of rows is different. By 1.5KB record, I meant that 
>> each row, when represented as a CSV entry, occupies 1500 bytes. I've used 
>> the terms "row" and "record" interchangeably, which might have been the 
>> source of some confusion.
>> 
>> I'll run the stress tool and report the results as well; the hardware is 
>> whatever AWS provides for c3.8xlarge EC2 instance.
>> 
>> Anurag
>> 
>>> On Jan 14, 2016, at 1:33 PM, Jack Krupansky <jack.krupan...@gmail.com> 
>>> wrote:
>>> 
>>> What exactly is "input size" here (1GB to 128GB)? I mean, the test spec 
>>> "The dataset used comprises of ~1.5KB records...  there are 105 attributes 
>>> in each record." Does each test run have exactly the same number of rows 
>>> and columns and you're just making each column bigger, or what?
>>> 
>>> Cassandra doesn't have "records", so are you really saying that you show 
>>> 1,500 rows? Is it one row per partition or do you have clustering?
>>> 
>>> What are you actually trying to measure? (Some more context would help.)
>>> 
>>> In any case, a latency of 200ms (5 per second) for yor search query seems 
>>> rather low, but we need some clarity on input size.
>>> 
>>> If you just run the cassandra stress tool on your hardware, what kinds of 
>>> numbers do you get. That should be the starting point for any benchmarking 
>>> - how does your hardware perform processing basic requests, before you 
>>> layer your own data modeling on top of that.
>>> 
>>> -- Jack Krupansky
>>> 
>>>> On Thu, Jan 14, 2016 at 4:02 PM, Jonathan Haddad <j...@jonhaddad.com> 
>>>> wrote:
>>>> I think you actually get a really useful metric by benchmarking 1 machine. 
>>>>  You understand your cluster's theoretical maximum performance, which 
>>>> would be Nodes * number of queries.  Yes, adding in replication and CL is 
>>>> important, but 1 machine lets you isolate certain performance metrics. 
>>>> 
>>>>> On Thu, Jan 14, 2016 at 12:23 PM Robert Wille <rwi...@fold3.com> wrote:
>>>>> I disagree. I think that you can extrapolate very little information 
>>>>> about RF>1 and CL>1 by benchmarking with RF=1 and CL=1.
>>>>> 
>>>>>> On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal <anur...@berkeley.edu> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hi John,
>>>>>> 
>>>>>> Thanks for responding!
>>>>>> 
>>>>>> The aim of this benchmark was not to benchmark Cassandra as an 
>>>>>> end-to-end distributed system, but to understand a break down of the 
>>>>>> performance. For instance, if we understand the performance 
>>>>>> characteristics that we can expect from a single machine cassandra 
>>>>>> instance with RF=Consistency=1, we can have a good estimate of what the 
>>>>>> distributed performance with higher replication factors and consistency 
>>>>>> are going to look like. Even in the ideal case, the performance 
>>>>>> improvement would scale at most linearly with more machines and replicas.
>>>>>> 
>>>>>> That being said, I still want to understand whether this is the 
>>>>>> performance I should expect for the setup I described; if the 
>>>>>> performance for the current setup can be improved, then clearly the 
>>>>>> performance for a production setup (with multiple nodes, replicas) would 
>>>>>> also improve. Does that make sense?
>>>>>> 
>>>>>> Thanks!
>>>>>> Anurag
>>>>>> 
>>>>>>> On Jan 6, 2016, at 9:31 AM, John Schulz <sch...@pythian.com> wrote:
>>>>>>> 
>>>>>>> Anurag,
>>>>>>> 
>>>>>>> Unless you are planning on continuing to use only one machine with RF=1 
>>>>>>> benchmarking a single system using RF=Consistancy=1 is mostly a waste 
>>>>>>> of time. If you are going to use RF=1 and a single host then why use 
>>>>>>> Cassandra at all. Plain old relational dbs should do the job just fine.
>>>>>>> Cassandra is designed to be distributed. You won't get the full impact 
>>>>>>> of how it scales and the limits on scaling unless you benchmark a 
>>>>>>> distributed system. For example the scaling impact of secondary indexes 
>>>>>>> will not be visible on a single node.
>>>>>>> 
>>>>>>> John
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal 
>>>>>>>> <anur...@berkeley.edu> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I’ve been benchmarking Cassandra to get an idea of how the performance 
>>>>>>>> scales with more data on a single machine. I just wanted to get some 
>>>>>>>> feedback to whether these are the numbers I should expect.
>>>>>>>> 
>>>>>>>> The benchmarks are quite simple — I measure the latency and throughput 
>>>>>>>> for two kinds of queries:
>>>>>>>> 
>>>>>>>> 1. get() queries - These fetch an entire row for a given primary key.
>>>>>>>> 2. search() queries - These fetch all the primary keys for rows where 
>>>>>>>> a particular column matches a particular value (e.g., “name” is “John 
>>>>>>>> Smith”). 
>>>>>>>> 
>>>>>>>> Indexes are constructed for all columns that are queried.
>>>>>>>> 
>>>>>>>> Dataset
>>>>>>>> 
>>>>>>>> The dataset used comprises of ~1.5KB records (on an average) when 
>>>>>>>> represented as CSV; there are 105 attributes in each record.
>>>>>>>> 
>>>>>>>> Queries
>>>>>>>> 
>>>>>>>> For get() queries, randomly generated primary keys are used.
>>>>>>>> 
>>>>>>>> For search() queries, column values are selected such that their total 
>>>>>>>> number of occurrences in the dataset is between 1 - 4000. For example, 
>>>>>>>> a query for  “name” = “John Smith” would only be performed if the 
>>>>>>>> number of rows that contain the same lies between 1-4000.
>>>>>>>> 
>>>>>>>> The results for the benchmarks are provided below:
>>>>>>>> 
>>>>>>>> Latency Measurements
>>>>>>>> 
>>>>>>>> The latency measurements are an average of 10000 queries.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Throughput Measurements
>>>>>>>> 
>>>>>>>> The throughput measurements were repeated for 1-16 client threads, and 
>>>>>>>> the numbers reported for each input size is for the configuration 
>>>>>>>> (i.e., # client threads) with the highest throughput.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Any feedback here would be greatly appreciated!
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> Anurag
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> John H. Schulz
>>>>>>> Principal Consultant
>>>>>>> Pythian - Love your data
>>>>>>> 
>>>>>>> sch...@pythian.com |  Linkedin 
>>>>>>> www.linkedin.com/pub/john-schulz/13/ab2/930/
>>>>>>> Mobile: 248-376-3380
>>>>>>> www.pythian.com
>>>>>>> 
>>>>>>> --
>>>>>>> 
> 

Reply via email to