Hi Jack, > So, your 1GB input size means roughly 716 thousand rows of data and 128GB > means roughly 92 million rows, correct?
Yes, that's correct. > Are your gets and searches returning single rows, or a significant number of > rows? Like I mentioned in my first email, get always returns a single row, and search returns variable number of rows. The number of rows returned varies from 1-4000. > -- Jack Krupansky > >> On Thu, Jan 14, 2016 at 4:43 PM, Anurag Khandelwal <anur...@berkeley.edu> >> wrote: >> To clarify: Input size is the size of the dataset as a CSV file, before >> loading it into Cassandra; for each input size, the number of columns is >> fixed but the number of rows is different. By 1.5KB record, I meant that >> each row, when represented as a CSV entry, occupies 1500 bytes. I've used >> the terms "row" and "record" interchangeably, which might have been the >> source of some confusion. >> >> I'll run the stress tool and report the results as well; the hardware is >> whatever AWS provides for c3.8xlarge EC2 instance. >> >> Anurag >> >>> On Jan 14, 2016, at 1:33 PM, Jack Krupansky <jack.krupan...@gmail.com> >>> wrote: >>> >>> What exactly is "input size" here (1GB to 128GB)? I mean, the test spec >>> "The dataset used comprises of ~1.5KB records... there are 105 attributes >>> in each record." Does each test run have exactly the same number of rows >>> and columns and you're just making each column bigger, or what? >>> >>> Cassandra doesn't have "records", so are you really saying that you show >>> 1,500 rows? Is it one row per partition or do you have clustering? >>> >>> What are you actually trying to measure? (Some more context would help.) >>> >>> In any case, a latency of 200ms (5 per second) for yor search query seems >>> rather low, but we need some clarity on input size. >>> >>> If you just run the cassandra stress tool on your hardware, what kinds of >>> numbers do you get. That should be the starting point for any benchmarking >>> - how does your hardware perform processing basic requests, before you >>> layer your own data modeling on top of that. >>> >>> -- Jack Krupansky >>> >>>> On Thu, Jan 14, 2016 at 4:02 PM, Jonathan Haddad <j...@jonhaddad.com> >>>> wrote: >>>> I think you actually get a really useful metric by benchmarking 1 machine. >>>> You understand your cluster's theoretical maximum performance, which >>>> would be Nodes * number of queries. Yes, adding in replication and CL is >>>> important, but 1 machine lets you isolate certain performance metrics. >>>> >>>>> On Thu, Jan 14, 2016 at 12:23 PM Robert Wille <rwi...@fold3.com> wrote: >>>>> I disagree. I think that you can extrapolate very little information >>>>> about RF>1 and CL>1 by benchmarking with RF=1 and CL=1. >>>>> >>>>>> On Jan 13, 2016, at 8:41 PM, Anurag Khandelwal <anur...@berkeley.edu> >>>>>> wrote: >>>>>> >>>>>> Hi John, >>>>>> >>>>>> Thanks for responding! >>>>>> >>>>>> The aim of this benchmark was not to benchmark Cassandra as an >>>>>> end-to-end distributed system, but to understand a break down of the >>>>>> performance. For instance, if we understand the performance >>>>>> characteristics that we can expect from a single machine cassandra >>>>>> instance with RF=Consistency=1, we can have a good estimate of what the >>>>>> distributed performance with higher replication factors and consistency >>>>>> are going to look like. Even in the ideal case, the performance >>>>>> improvement would scale at most linearly with more machines and replicas. >>>>>> >>>>>> That being said, I still want to understand whether this is the >>>>>> performance I should expect for the setup I described; if the >>>>>> performance for the current setup can be improved, then clearly the >>>>>> performance for a production setup (with multiple nodes, replicas) would >>>>>> also improve. Does that make sense? >>>>>> >>>>>> Thanks! >>>>>> Anurag >>>>>> >>>>>>> On Jan 6, 2016, at 9:31 AM, John Schulz <sch...@pythian.com> wrote: >>>>>>> >>>>>>> Anurag, >>>>>>> >>>>>>> Unless you are planning on continuing to use only one machine with RF=1 >>>>>>> benchmarking a single system using RF=Consistancy=1 is mostly a waste >>>>>>> of time. If you are going to use RF=1 and a single host then why use >>>>>>> Cassandra at all. Plain old relational dbs should do the job just fine. >>>>>>> Cassandra is designed to be distributed. You won't get the full impact >>>>>>> of how it scales and the limits on scaling unless you benchmark a >>>>>>> distributed system. For example the scaling impact of secondary indexes >>>>>>> will not be visible on a single node. >>>>>>> >>>>>>> John >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Tue, Jan 5, 2016 at 3:16 PM, Anurag Khandelwal >>>>>>>> <anur...@berkeley.edu> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I’ve been benchmarking Cassandra to get an idea of how the performance >>>>>>>> scales with more data on a single machine. I just wanted to get some >>>>>>>> feedback to whether these are the numbers I should expect. >>>>>>>> >>>>>>>> The benchmarks are quite simple — I measure the latency and throughput >>>>>>>> for two kinds of queries: >>>>>>>> >>>>>>>> 1. get() queries - These fetch an entire row for a given primary key. >>>>>>>> 2. search() queries - These fetch all the primary keys for rows where >>>>>>>> a particular column matches a particular value (e.g., “name” is “John >>>>>>>> Smith”). >>>>>>>> >>>>>>>> Indexes are constructed for all columns that are queried. >>>>>>>> >>>>>>>> Dataset >>>>>>>> >>>>>>>> The dataset used comprises of ~1.5KB records (on an average) when >>>>>>>> represented as CSV; there are 105 attributes in each record. >>>>>>>> >>>>>>>> Queries >>>>>>>> >>>>>>>> For get() queries, randomly generated primary keys are used. >>>>>>>> >>>>>>>> For search() queries, column values are selected such that their total >>>>>>>> number of occurrences in the dataset is between 1 - 4000. For example, >>>>>>>> a query for “name” = “John Smith” would only be performed if the >>>>>>>> number of rows that contain the same lies between 1-4000. >>>>>>>> >>>>>>>> The results for the benchmarks are provided below: >>>>>>>> >>>>>>>> Latency Measurements >>>>>>>> >>>>>>>> The latency measurements are an average of 10000 queries. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Throughput Measurements >>>>>>>> >>>>>>>> The throughput measurements were repeated for 1-16 client threads, and >>>>>>>> the numbers reported for each input size is for the configuration >>>>>>>> (i.e., # client threads) with the highest throughput. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Any feedback here would be greatly appreciated! >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Anurag >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> John H. Schulz >>>>>>> Principal Consultant >>>>>>> Pythian - Love your data >>>>>>> >>>>>>> sch...@pythian.com | Linkedin >>>>>>> www.linkedin.com/pub/john-schulz/13/ab2/930/ >>>>>>> Mobile: 248-376-3380 >>>>>>> www.pythian.com >>>>>>> >>>>>>> -- >>>>>>> >