yes, it is expected that writes are cpu-bound. On Fri, Jun 11, 2010 at 11:29 AM, Rishi Bhardwaj <khichri...@yahoo.com> wrote: > I think it would be a good exercise to know what the CPU bottleneck is on > the write path. The fact that Cassandra optimizes disk I/O for writes would > only go so far if the CPU becomes a big bottleneck on continuous writes. I > am fairly new to Java ecosystem performance profiling but I would give it a > try and see if I can pinpoint the problem area here. I am also thinking > about making concurrent writes to cassandra instead of only one write at a > time. This would probably make Cassandra beat the hell out of all CPU > resources and confirm that Cassandra is CPU bound on continuous writes. > Again, I would love to hear from Cassandra experts here and see what they > think of this. Are Cassandra continuous bulk writes expected to be > bottlenecked by CPU? If this is definitely the case and thats what it seems > right now, then it would be a good thing to look at the algorithms in the > write path. > Thanks, > Rishi > ________________________________ > From: Mike Malone <m...@simplegeo.com> > To: user@cassandra.apache.org > Sent: Fri, June 11, 2010 9:20:06 AM > Subject: Re: Cassandra Write Performance, CPU usage > > Jonathan, while I agree with you re: this being an unusual load for the > system, it is interesting that he's found at least one use-case where > Cassandra is CPU-bound, not IO-bound. I'd definitely be interested in > learning what his critical path is and seeing if there's some low-hanging > fruit that may improve performance overall. I have also noticed very high > CPU usage during high write loads and have wondered whether write speed and > throughput could be improved by improving some of the algorithms along that > path. > I'm nowhere near being an expert on the whole Java ecosystem, but I've had > good luck with the `jvisualvm` tool that comes with Java SE 6. It's a nice > lightweight CPU and memory profiling tool that can attach to a running > process like Cassandra and dump stats in real time. > Mike > > On Thu, Jun 10, 2010 at 7:39 PM, Jonathan Shook <jsh...@gmail.com> wrote: >> >> You are testing Cassandra in a way that it was not designed to be used. >> Bandwidth to disk is not a meaningful example for nearly anything >> except for filesystem benchmarking and things very nearly the same as >> filesystem benchmarking. >> Unless the usage patterns of your application match your test data, >> there is not a good reason to expect a strong correlation between this >> test and actual performance. >> >> Cassandra is not simply shuffling data through IO when you write. >> There are calculations that have to be done as writes filter their way >> through various stages of processing. The point of this is to minimize >> the overall effort Cassandra has to make in order to retrieve the data >> again. One example would be bloom filters. Each column that is written >> requires bloom filter processing and potentially auxiliary IO. Some of >> these steps are allowed to happen in the background, but if you try, >> you can cause them to stack up on top of the available CPU and memory >> resources. >> >> In such a case (continuous bulk writes), you are causing all of these >> costs to be taken in more of a synchronous (not delayed) fashion. You >> are not allowing the background processing that helps reduce client >> blocking (by deferring some processing) to do its magic. >> >> >> >> On Thu, Jun 10, 2010 at 7:42 PM, Rishi Bhardwaj <khichri...@yahoo.com> >> wrote: >> > Hi >> > I am investigating Cassandra write performance and see very heavy CPU >> > usage >> > from Cassandra. I have a single node Cassandra instance running on a >> > dual >> > core (2.66 Ghz Intel ) Ubuntu 9.10 server. The writes to Cassandra are >> > being >> > generated from the same server using BatchMutate(). The client makes >> > exactly >> > one RPC call at a time to Cassandra. Each BatchMutate() RPC contains 2 >> > MB of >> > data and once it is acknowledged by Cassandra, the next RPC is done. >> > Cassandra has two separate disks, one for commitlog with a sequential >> > b/w of >> > 130MBps and the other a solid state disk for data with b/w of 90MBps. >> > Tuning >> > various parameters, I observe that I am able to attain a maximum write >> > performance of about 45 to 50 MBps from Cassandra. I see that the >> > Cassandra >> > java process consistently uses 100% to 150% of CPU resources (as shown >> > by >> > top) during the entire write operation. Also, iostat clearly shows that >> > the >> > max disk bandwidth is not reached anytime during the write operation, >> > every >> > now and then the i/o activity on "commitlog" disk and the data disk >> > spike >> > but it is never consistently maintained by cassandra close to their >> > peak. I >> > would imagine that the CPU is probably the bottleneck here. Does anyone >> > have >> > any idea why Cassandra beats the heck out of the CPU here? Any >> > suggestions >> > on how to go about finding the exact bottleneck here? >> > Some more information about the writes: I have 2 column families, the >> > data >> > though is mostly written in one column family with column sizes of >> > around >> > 32k and each row having around 256 or 512 columns. I would really >> > appreciate >> > any help here. >> > Thanks, >> > Rishi >> > >> > > > >
-- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com