Thanks Rob.

To be clear, I expect this range query to take a long time and perform
relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
https://issues.apache.org/jira/browse/CASSANDRA-4415,
http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
so that we aren't literally pulling the entire thing in. Am I
misunderstanding this use case? Could you clarify why exactly it would slow
way down? It seems like with each read it should be doing a simple range
read from one or two sstables.

If this won't work then it may me we need to start using Hive/Spark/Pig
etc. sooner, or page it manually using LIMIT and WHERE > [the last returned
result].

On Mon, Nov 24, 2014 at 5:49 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Mon, Nov 24, 2014 at 4:26 PM, Dan Kinder <dkin...@turnitin.com> wrote:
>
>> We have a web crawler project currently based on Cassandra (
>> https://github.com/iParadigms/walker, written in Go and using the gocql
>> driver), with the following relevant usage pattern:
>>
>> - Big range reads over a CF to grab potentially millions of rows and
>> dispatch new links to crawl
>>
>
> If you really mean millions of storage rows, this is just about the worst
> case for Cassandra. The problem you're having is probably that you
> shouldn't try to do this in Cassandra.
>
> Your timeouts are either from the read actually taking longer than the
> timeout or from the reads provoking heap pressure and resulting GC.
>
> =Rob
>
>

Reply via email to