Hi Paolo, Thanks for the hint - JNA indeed wasn't installed. However, now that cassandra is actually using it, there doesn't seem to be any change in terms of speed - still 7 seconds with pycassa.
On Thu, Apr 19, 2012 at 12:14 AM, Paolo Bernardi <berna...@gmail.com> wrote: > Look into your Cassandra's logs to see if JNA is really enabled (it > really should be, by default), and more importantly if JNA is loaded > correctly. You might find some surprising message over there: if this > is the case, just install JNA with your distro's package manager and, > if still doesn't work, copy the JNA jar into Cassandra's lib directory > (been there, done that). > > Paolo > > On Thu, Apr 19, 2012 at 8:26 AM, Dan Feldman <hriunde...@gmail.com> wrote: > > Hi Tyler and Aaron, > > > > Thanks for your replies. > > > > Tyler, > > fetching scs using your pycassa script on our server takes ~7 s - > consistent > > with the times we've been seeing. Now, we aren't really experts in > > Cassandra, but it seems that JNA is enabled by default for Cassandra > > 1.0 > > according to Jeremy > > (http://comments.gmane.org/gmane.comp.db.cassandra.user/21441). But in > case > > it isn't, how do you turn it on in 1.0.8? > > > > I'm also setting MAX_HEAP_SIZE="2G" in cassandra-env.sh. I'm hoping > that's > > how you increase java heap size. I've tried "3G" as well, without any > > increase in performance. It did however allow for taking larger slices. > > > > Aaron, > > we are not doing multi-threaded requests for now, but we'll give it a > shot > > in the next day or two and I'll let you know if there is any improvement > > > > Thanks for your help! > > Dan F. > > > > > > > > On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs <ty...@datastax.com> wrote: > >> > >> I tested this out with a small pycassa script: > >> https://gist.github.com/2418598 > >> > >> On my not-very-impressive laptop, I can read 5000 of the super columns > in > >> 3 seconds (cold) or 1.5 (warm). Reading in batches of 1000 super > columns at > >> a time gives much better performance; I definitely recommend going with > a > >> smaller batch size. > >> > >> Make sure that the timeout on your ConnectionPool isn't too low to > handle > >> a big request in pycassa. If you turn on logging (as it is in the > script I > >> linked), you should be able to see if the request is timing out a > couple of > >> times before it succeeds. > >> > >> It might also be good to make sure that you've got JNA in place and your > >> heap size is sufficient. > >> > >> > >> On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner <synfina...@gmail.com> > >> wrote: > >>> > >>> On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman <hriunde...@gmail.com> > >>> wrote: > >>> > Hi all, > >>> > > >>> > I'm trying to optimize moving data from Cassandra to HDFS using > either > >>> > Ruby > >>> > or Python client. Right now, I'm playing around on my staging server, > >>> > an 8 > >>> > GB single node machine. My data in Cassandra (1.0.8) consist of 2 > rows > >>> > (for > >>> > now) with ~150k super columns each (I know, I know - super columns > are > >>> > bad). > >>> > Every super column has ~25 columns totaling ~800 bytes per super > >>> > column. > >>> > > >>> > I should also mention that currently the database is static - there > are > >>> > no > >>> > writes/updates, only reads. > >>> > > >>> > Anyways, in my python/ruby scripts, I'm taking slices of 5000 > >>> > supercolumns > >>> > long from a single row. It takes 13 seconds with ruby and 8 seconds > >>> > with > >>> > pycassa to get a single slice. Or, in other words, it's currently > >>> > reading at > >>> > speeds of less than 500 kB per second. The speed seems to be linear > >>> > with the > >>> > length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run > >>> > nodetool > >>> > cfstats while my script is running, it tells me that my read latency > on > >>> > the > >>> > column family is ~300ms. > >>> > > >>> > I assume that this is not normal and thus was wondering what > parameters > >>> > I > >>> > could tweak to improve the performance. > >>> > > >>> > >>> Is your client mult-threaded? The single threaded performance of > >>> Cassandra isn't at all impressive and it really is designed for > >>> dealing with a lot of simultaneous requests. > >>> > >>> > >>> -- > >>> Aaron Turner > >>> http://synfin.net/ Twitter: @synfinatic > >>> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix > & > >>> Windows > >>> Those who would give up essential Liberty, to purchase a little > temporary > >>> Safety, deserve neither Liberty nor Safety. > >>> -- Benjamin Franklin > >>> "carpe diem quam minimum credula postero" > >> > >> > >> > >> > >> -- > >> Tyler Hobbs > >> DataStax > >> > > >