When you say query 1 million records in my mind i'm saying "dump 1 million records to another system as a back office job". Hadoop will split the job over multiple nodes and will assign a task to read the range "owned" by each node. From memory it uses CL ONE (by default) for the read so the node it is connected to is the only one involved in the read. Also the task can be run on the node rather than off node.
This does not magic up up some new IO capacity though. It will spread the work load so to add IO capacity add nodes. You could do something similar by reducing the CL level and querying through the thrift interface. Then only ask a node for data in the key range it "owns". If this does not help the next step is to borrow from the ideas in Data Stax Brisk (now Data Stax Enterprise). Use the NetworkTopologyStrategy and two data centres (or a Virtual Data Centre http://wiki.apache.org/cassandra/HadoopSupport). One DC is for OLTP and the other for OLAP / Export. The OLTP side will be able to run without interruption from the OLAP side. Another option is use something like Kafka and fork the data stream, send it to cassandra and the external system at the same time. Hope that helps. ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/02/2012, at 2:21 PM, Martin Arrowsmith wrote: > Hi Alexandru, > > Things got hectic and I put off the project until this weekend. I'm actually > learning about Hadoop right now and how to implement it. I can respond to > this thread when I have something running. > > In the meantime, I'd like to bump this email up and see if there are others > who can provide some feedback. 1) Will Hadoop speed up the time to read all > the rows? 2) Are there other options? > > My guess was that hadoop could split up your jobs, so each node could handle > a portion of the query. For instance, having 2 nodes would do the job twice > as fast. That is my naive guess though and could be far from the truth. > > Best wishes, > > Martin > > On Fri, Feb 24, 2012 at 5:29 AM, Alexandru Sicoe <adsi...@gmail.com> wrote: > Hi Aaron and Martin, > > Sorry about my previous reply, I thought you wanted to process only all the > row keys in CF. > > I have a similar issue as Martin because I see myself being forced to hit > more than a million rows with a query (I only get a few columns from every > row). Aaron, we've talked about this in another thread, basically I am > constrained to ship out a window of data from my online cluster to an offline > cluster. For this I need to read for example 5 min window of all the data I > have. This simply accesses too many rows and I am hitting the I/O limit on > the nodes. As I understand for every row it will do 2 random disk seeks (I > have no caches). > > My question is, what can I do to improve the performance of shipping windows > of data entirely out? > > Martin, did you use Hadoop as Aaron suggested? How did that work with > Cassandra? I don't understand how accessing 1 million of rows through map > reduce jobs be any faster? > > Cheers, > Alexandru > > > > On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <aa...@thelastpickle.com> > wrote: > If you want to process 1 million rows use Hadoop with Hive or Pig. If you use > Hadoop you are not doing things in real time. > > You may need to rephrase the problem. > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote: > >> Hi Experts, >> >> My program is such that it queries all keys on Cassandra. I want to do this >> as quick as possible, in order to get as close to real-time as possible. >> >> One solution I heard was to use the sstables2json tool, and read the data in >> as JSON. I understand that reading from each line in Cassandra might take >> longer. >> >> Are there any other ideas for doing this ? Or can you confirm that >> sstables2json is the way to go. >> >> Querying 100 rows in Cassandra the normal way is fast enough. I'd like to >> query a million rows, do some calculations on them, and spit out the result >> like it's real time. >> >> Thanks for any help you can give, >> >> Martin > > >