Thanks Rob. To be clear, I expect this range query to take a long time and perform relatively heavy I/O. What I expected Cassandra to do was use auto-paging ( https://issues.apache.org/jira/browse/CASSANDRA-4415, http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3) so that we aren't literally pulling the entire thing in. Am I misunderstanding this use case? Could you clarify why exactly it would slow way down? It seems like with each read it should be doing a simple range read from one or two sstables.
If this won't work then it may me we need to start using Hive/Spark/Pig etc. sooner, or page it manually using LIMIT and WHERE > [the last returned result]. On Mon, Nov 24, 2014 at 5:49 PM, Robert Coli <rc...@eventbrite.com> wrote: > On Mon, Nov 24, 2014 at 4:26 PM, Dan Kinder <dkin...@turnitin.com> wrote: > >> We have a web crawler project currently based on Cassandra ( >> https://github.com/iParadigms/walker, written in Go and using the gocql >> driver), with the following relevant usage pattern: >> >> - Big range reads over a CF to grab potentially millions of rows and >> dispatch new links to crawl >> > > If you really mean millions of storage rows, this is just about the worst > case for Cassandra. The problem you're having is probably that you > shouldn't try to do this in Cassandra. > > Your timeouts are either from the read actually taking longer than the > timeout or from the reads provoking heap pressure and resulting GC. > > =Rob > >