In the absence of streaming, how does one retrieve a large result set in CQL3? E.g., to use the same example: > CREATE TABLE bug_test (a int, b int, c int, d int, e int, f text, PRIMARY > KEY (a, b, c, d, e) ); > with some data in it: > > SELECT * FROM bug_test; > > Results: > > a | b | c | d | e | f > ------+-- > 1 | 1 | 1 | 1 | 1 | 1 > 1 | 1 | 1 | 1 | 2 | 2 > 1 | 1 | 1 | 1 | 3 | 3 > 1 | 1 | 1 | 1 | 5 | 5 > 1 | 1 | 1 | 2 | 1 | 1 The query SELECT * FROM bug_test WHERE a = 1 AND b = 1; will retrieve the above 5 rows. But what if my dataset had a million rows that match the query? How can I retrieve them?
Thanks! Thorsten On 7/24/2012 11:58 PM, Sylvain Lebresne wrote: > The short answer is yes, we are looking into adding streaming of > results to solve that problem > (https://issues.apache.org/jira/browse/CASSANDRA-4415). > > -- > Sylvain > > On Tue, Jul 24, 2012 at 6:51 PM, Josep Blanquer <blanq...@rightscale.com> > wrote: >> Thank Sylvain, >> >> The main argument for this is pagination. Let me try to explain the use >> cases, and compare it to RDBMS for better illustration: >> 1- Right now, Cassandra doesn't stream the requests, so large resultsets >> are a royal pain in the neck to deal with. I.e., if I have a range_slice, or >> even a slice query that cuts across 1 million columns...I have to completely >> "eat it all" in the client receiving the response. That is, I'll need to >> store 1 million results in the client no matter what, and that can be quite >> prohibitive. >> 2- In an effort to alleviate that, one can be smarter in the client and >> play the pagination game...i.e., start slicing at some column and get the >> next N results, then start the slice at the last column seen and get N >> more....etc. That results in many more queries from the smart client, but at >> least it would allow you to handle large result sets. (That's where the need >> for the CQL query in my original email was about). >> 3- There's another important factor related to this problem in my opinion: >> the LIMIT clause in Cassandra (in both CQL or Thrift) is a "required" field. >> What I mean by "required" is that cassandra requires an explicit "count" to >> operate underneath. So it is really different from RDBMS' semantics where no >> LIMIT means you'll get "all" the results (instead of the high, yet still >> bound count of 10K or 20K max resultset row cassandra enforces by >> defaul)...and I cannot tell you how many problems we've had with developers >> forgetting about these "default" counts in queries, and realizing that some >> had results truncated because of that...in my mind, LIMIT should be to only >> used restrict results...queries with no LIMIT should always return all >> results (much like RDBMS)...otherwise the query "seems" the same but it is >> semantically different. >> >> So, all in all I think that the main problem/use case I'm facing is that >> Cassandra cannot stream resultsets. If it did, I believe that the need for >> my pagination use case would basically disappear, since it'd be the >> transport/client that would throttle how many results are stored in the >> client buffer at any point time. At the same time, I believe that with a >> streaming protocol you could simply change Cassandra internals to have >> "infinite" default limits...since there wouldn't be no reason to stop >> scanning (unless an explicit LIMIT clause was specified by the client). That >> would give you not only the SQL-equivalent syntax, but also the equivalent >> semantics of most current DBs. >> >> I hope that makes sense. That being said, are there any plans for streaming >> results? I believe that without that (and especially with the new CQL >> restrictions) it make much more difficult to use Cassandra with wide rows >> and large resultsets (which, in my mind is one of its sweet spots ). I >> believe that if that doesn't happen it would a) force the clients to be >> built in a much more complex and inefficient way to handle wide rows or b) >> will force users to use different, less efficient datamodels for their data. >> Both seem bad propositions to me, as they wouldn't be taking advantage of >> Cassandra's power, therefore diminishing its value. >> >> Cheers, >> >> Josep M. >> >> >> On Tue, Jul 24, 2012 at 3:11 AM, Sylvain Lebresne <sylv...@datastax.com> >> wrote: >>> On Tue, Jul 24, 2012 at 12:09 AM, Josep Blanquer >>> <blanq...@rightscale.com> wrote: >>>> is there some way to express that in CQL3? something logically >>>> equivalent to >>>> >>>> SELECT * FROM bug_test WHERE a:b:c:d:e > 1:1:1:1:2 ?? >>> No, there isn't. Not currently at least. But feel free of course to >>> open a ticket/request on >>> https://issues.apache.org/jira/browse/CASSANDRA. >>> >>> I note that I would be curious to know the concrete use case you have >>> for such type of queries. It would also help as an argument to add >>> such facilities more quickly (or at all). Typically, "we should >>> support it in CQL3 because it was possible with thrift" is >>> definitively an argument, but a much weaker one without concrete >>> examples of why it might be useful in the first place. >>> >>> -- >>> Sylvain >>