I tested this out with a small pycassa script: https://gist.github.com/2418598
On my not-very-impressive laptop, I can read 5000 of the super columns in 3 seconds (cold) or 1.5 (warm). Reading in batches of 1000 super columns at a time gives much better performance; I definitely recommend going with a smaller batch size. Make sure that the timeout on your ConnectionPool isn't too low to handle a big request in pycassa. If you turn on logging (as it is in the script I linked), you should be able to see if the request is timing out a couple of times before it succeeds. It might also be good to make sure that you've got JNA in place and your heap size is sufficient. On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner <synfina...@gmail.com> wrote: > On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman <hriunde...@gmail.com> wrote: > > Hi all, > > > > I'm trying to optimize moving data from Cassandra to HDFS using either > Ruby > > or Python client. Right now, I'm playing around on my staging server, an > 8 > > GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows > (for > > now) with ~150k super columns each (I know, I know - super columns are > bad). > > Every super column has ~25 columns totaling ~800 bytes per super column. > > > > I should also mention that currently the database is static - there are > no > > writes/updates, only reads. > > > > Anyways, in my python/ruby scripts, I'm taking slices of 5000 > supercolumns > > long from a single row. It takes 13 seconds with ruby and 8 seconds with > > pycassa to get a single slice. Or, in other words, it's currently > reading at > > speeds of less than 500 kB per second. The speed seems to be linear with > the > > length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run > nodetool > > cfstats while my script is running, it tells me that my read latency on > the > > column family is ~300ms. > > > > I assume that this is not normal and thus was wondering what parameters I > > could tweak to improve the performance. > > > > Is your client mult-threaded? The single threaded performance of > Cassandra isn't at all impressive and it really is designed for > dealing with a lot of simultaneous requests. > > > -- > Aaron Turner > http://synfin.net/ Twitter: @synfinatic > http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & > Windows > Those who would give up essential Liberty, to purchase a little temporary > Safety, deserve neither Liberty nor Safety. > -- Benjamin Franklin > "carpe diem quam minimum credula postero" > -- Tyler Hobbs DataStax <http://datastax.com/>