In the absence of streaming, how does one retrieve a large result set in
CQL3? E.g., to use the same example:
> CREATE TABLE bug_test (a int, b int, c int, d int, e int, f text, PRIMARY
> KEY (a, b, c, d, e) );
> with some data in it:
>
> SELECT * FROM bug_test;
>
> Results:
>
> a | b | c | d | e | f
> ------+--
> 1 | 1 | 1 | 1 | 1 | 1
> 1 | 1 | 1 | 1 | 2 | 2
> 1 | 1 | 1 | 1 | 3 | 3
> 1 | 1 | 1 | 1 | 5 | 5
> 1 | 1 | 1 | 2 | 1 | 1
The query
SELECT * FROM bug_test WHERE a = 1 AND b = 1;
will retrieve the above 5 rows. But what if my dataset had a million
rows that match the query? How can I retrieve them?

Thanks!
Thorsten



On 7/24/2012 11:58 PM, Sylvain Lebresne wrote:
> The short answer is yes, we are looking into adding streaming of
> results to solve that problem
> (https://issues.apache.org/jira/browse/CASSANDRA-4415).
>
> --
> Sylvain
>
> On Tue, Jul 24, 2012 at 6:51 PM, Josep Blanquer <blanq...@rightscale.com> 
> wrote:
>> Thank Sylvain,
>>
>>  The main argument for this is pagination. Let me try to explain the use
>> cases, and compare it to RDBMS for better illustration:
>>  1- Right now, Cassandra doesn't stream the requests, so large resultsets
>> are a royal pain in the neck to deal with. I.e., if I have a range_slice, or
>> even a slice query that cuts across 1 million columns...I have to completely
>> "eat it all" in the client receiving the response. That is, I'll need to
>> store 1 million results in the client no matter what, and that can be quite
>> prohibitive.
>>  2- In an effort to alleviate that, one can be smarter in the client and
>> play the pagination game...i.e., start slicing at some column and get the
>> next N results, then start the slice at the last column seen and get N
>> more....etc. That results in many more queries from the smart client, but at
>> least it would allow you to handle large result sets. (That's where the need
>> for the CQL query in my original email was about).
>> 3- There's another important factor related to this problem in my opinion:
>> the LIMIT clause in Cassandra (in both CQL or Thrift) is a "required" field.
>> What I mean by "required" is that cassandra requires an explicit "count" to
>> operate underneath. So it is really different from RDBMS' semantics where no
>> LIMIT means you'll get "all" the results (instead of the high, yet still
>> bound count of 10K or 20K max resultset row cassandra enforces by
>> defaul)...and I cannot tell you how many problems we've had with developers
>> forgetting about these "default" counts in queries, and realizing that some
>> had results truncated because of that...in my mind, LIMIT should be to only
>> used restrict results...queries with no LIMIT should always return all
>> results (much like RDBMS)...otherwise the query "seems" the same but it is
>> semantically different.
>>
>> So, all in all I think that the main problem/use case I'm facing is that
>> Cassandra cannot stream resultsets. If it did, I believe that the need for
>> my pagination use case would basically disappear, since it'd be the
>> transport/client that would throttle how many results are stored in the
>> client buffer at any point time. At the same time, I believe that with a
>> streaming protocol you could simply change Cassandra internals to have
>> "infinite" default limits...since there wouldn't be no reason to stop
>> scanning (unless an explicit LIMIT clause was specified by the client). That
>> would give you not only the SQL-equivalent syntax, but also the equivalent
>> semantics of most current DBs.
>>
>> I hope that makes sense. That being said, are there any plans for streaming
>> results? I believe that without that (and especially with the new CQL
>> restrictions) it make much more difficult to use Cassandra with wide rows
>> and large resultsets (which, in my mind is one of its sweet spots ). I
>> believe that if that doesn't happen it would a) force the clients to be
>> built in a much more complex and inefficient way to handle wide rows or b)
>> will force users to use different, less efficient datamodels for their data.
>> Both seem bad propositions to me, as they wouldn't be taking advantage of
>> Cassandra's power, therefore diminishing its value.
>>
>>  Cheers,
>>
>>  Josep M.
>>
>>
>> On Tue, Jul 24, 2012 at 3:11 AM, Sylvain Lebresne <sylv...@datastax.com>
>> wrote:
>>> On Tue, Jul 24, 2012 at 12:09 AM, Josep Blanquer
>>> <blanq...@rightscale.com> wrote:
>>>> is there some way to express that in CQL3? something logically
>>>> equivalent to
>>>>
>>>> SELECT *  FROM bug_test WHERE a:b:c:d:e > 1:1:1:1:2    ??
>>> No, there isn't. Not currently at least. But feel free of course to
>>> open a ticket/request on
>>> https://issues.apache.org/jira/browse/CASSANDRA.
>>>
>>> I note that I would be curious to know the concrete use case you have
>>> for such type of queries. It would also help as an argument to add
>>> such facilities more quickly (or at all). Typically, "we should
>>> support it in CQL3 because it was possible with thrift" is
>>> definitively an argument, but a much weaker one without concrete
>>> examples of why it might be useful in the first place.
>>>
>>> --
>>> Sylvain
>>

Reply via email to