Hi Aaron and Martin, Sorry about my previous reply, I thought you wanted to process only all the row keys in CF.
I have a similar issue as Martin because I see myself being forced to hit more than a million rows with a query (I only get a few columns from every row). Aaron, we've talked about this in another thread, basically I am constrained to ship out a window of data from my online cluster to an offline cluster. For this I need to read for example 5 min window of all the data I have. This simply accesses too many rows and I am hitting the I/O limit on the nodes. As I understand for every row it will do 2 random disk seeks (I have no caches). My question is, what can I do to improve the performance of shipping windows of data entirely out? Martin, did you use Hadoop as Aaron suggested? How did that work with Cassandra? I don't understand how accessing 1 million of rows through map reduce jobs be any faster? Cheers, Alexandru On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <aa...@thelastpickle.com>wrote: > If you want to process 1 million rows use Hadoop with Hive or Pig. If you > use Hadoop you are not doing things in real time. > > You may need to rephrase the problem. > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote: > > Hi Experts, > > My program is such that it queries all keys on Cassandra. I want to do > this as quick as possible, in order to get as close to real-time as > possible. > > One solution I heard was to use the sstables2json tool, and read the data > in as JSON. I understand that reading from each line in Cassandra might > take longer. > > Are there any other ideas for doing this ? Or can you confirm that > sstables2json is the way to go. > > Querying 100 rows in Cassandra the normal way is fast enough. I'd like to > query a million rows, do some calculations on them, and spit out the result > like it's real time. > > Thanks for any help you can give, > > Martin > > >