Re: Pagination support on Java Driver Query API

Ajay Thu, 12 Feb 2015 07:16:19 -0800

Thanks Eric. I figured out the same but didn't get time to put it on the
mail. Thanks.


But it is highly tied up to how data is stored internally in Cassandra.
Basically how partition keys are used to distribute (less likely to change.
We are not directly dependence on the partition algo) and clustering keys
are used to sort the data with in a partition( multi level sorting and
henceforth the restrictions on the ORDER BY clause) which I think can
change likely down the lane in Cassandra 3.x or 4.x in an different way for
some better storage or retrieval.

Thats said I am hesitant to implement this client side logic for pagination
for a) 2+ queries might need more than one query to Cassandra. b)  tied up
implementation to Cassandra internal storage details which can
change(though not often). c) in our case, we are building REST Apis which
will be deployed Tomcat clusters. Hence whatever we cache to support
pagination, need to be cached in a distributed way for failover support.

It (pagination support) is best done at the server side like ROWNUM in SQL
or better done in Java driver to hide the internal details and can be
optimized better as server sends the paging state with the driver.

Thanks
Ajay
On Feb 12, 2015 8:22 PM, "Eric Stevens" <migh...@gmail.com> wrote:

> Your page state then needs to track the last ck1 and last ck2 you saw.
> Pages 2+ will end up needing to be up to two queries if the first query
> doesn't fill the page size.
>
> CREATE TABLE foo (
>   partitionkey int,
>   ck1 int,
>   ck2 int,
>   col1 int,
>   col2 int,
>   PRIMARY KEY ((partitionkey), ck1, ck2)
> ) WITH CLUSTERING ORDER BY (ck1 asc, ck2 desc);
>
> INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,1,1,1);
> INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,2,2,2);
> INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,3,3,3);
> INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,1,4,4);
> INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,2,5,5);
> INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,3,6,6);
>
> If you're pulling the whole of partition 1 and your page size is 2, your
> first page looks like:
>
> *PAGE 1*
>
> SELECT * FROM foo WHERE partitionkey = 1 LIMIT 2;
>  partitionkey | ck1 | ck2 | col1 | col2
> --------------+-----+-----+------+------
>             1 |   1 |   3 |    3 |    3
>             1 |   1 |   2 |    2 |    2
>
> You got enough rows to satisfy the page, Your page state is taken from the
> last row: (ck1=1, ck2=2)
>
>
> *PAGE 2*
> Notice that you have a page state, and add some limiting clauses on the
> statement:
>
> SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 1 AND ck2 < 2 LIMIT 2;
>  partitionkey | ck1 | ck2 | col1 | col2
> --------------+-----+-----+------+------
>             1 |   1 |   1 |    1 |    1
>
> Oops, we didn't get enough rows to satisfy the page limit, so we need to
> continue on, we just need one more:
>
> SELECT * FROM foo WHERE partitionkey = 1 AND ck1 > 1 LIMIT 1;
>  partitionkey | ck1 | ck2 | col1 | col2
> --------------+-----+-----+------+------
>             1 |   2 |   3 |    6 |    6
>
> We have enough to satisfy page 2 now, our new page state: (ck1 = 2, ck2 =
> 3).
>
>
> *PAGE 3*
>
> SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 2 AND ck2 < 3 LIMIT 2;
>  partitionkey | ck1 | ck2 | col1 | col2
> --------------+-----+-----+------+------
>             1 |   2 |   2 |    5 |    5
>             1 |   2 |   1 |    4 |    4
>
> Great, we satisfied this page with only one query, page state: (ck1 = 2,
> ck2 = 1).
>
>
> *PAGE 4*
>
> SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 2 AND ck2 < 1 LIMIT 2;
> (0 rows)
>
> Oops, our initial query was on the boundary of ck1, but this looks like
> any other time that the initial query returns < pageSize rows, we just move
> on to the next page:
>
> SELECT * FROM foo WHERE partitionkey = 1 AND ck1 > 2 LIMIT 2;
> (0 rows)
>
> Aha, we've exhausted ck1 as well, so there are no more pages, page 3
> actually pulled the last possible value; page 4 is empty, and we're all
> done.  Generally speaking you know you're done when your first clustering
> key is the only non-equality operator in the statement, and you got no rows
> back.
>
>
>
>
>
>
> On Wed, Feb 11, 2015 at 10:55 AM, Ajay <ajay.ga...@gmail.com> wrote:
>
>> Basically I am trying different queries with your approach.
>>
>> One such query is like
>>
>> Select * from mycf where condition on partition key order by ck1 asc, ck2
>> desc where ck1 and ck2 are clustering keys in that order.
>>
>> Here how do we achieve pagination support?
>>
>> Thanks
>> Ajay
>> On Feb 11, 2015 11:16 PM, "Ajay" <ajay.ga...@gmail.com> wrote:
>>
>>>
>>> Hi Eric,
>>>
>>> Thanks for your reply.
>>>
>>> I am using Cassandra 2.0.11 and in that I cannot append condition like
>>> last clustering key column > value of the last row in the previous batch.
>>> It fails Preceding column is either not restricted or by a non-EQ relation.
>>> It means I need to specify equal  condition for all preceding clustering
>>> key columns. With this I cannot get the pagination correct.
>>>
>>> Thanks
>>> Ajay
>>> > I can't believe that everyone read & process all rows at once
>>> (without pagination).
>>>
>>> Probably not too many people try to read all rows in a table as a single
>>> rolling operation with a standard client driver.  But those who do would
>>> use token() to keep track of where they are and be able to resume with that
>>> as well.
>>>
>>> But it sounds like you're talking about paginating a subset of data -
>>> larger than you want to process as a unit, but prefiltered by some other
>>> criteria which prevents you from being able to rely on token().  For this
>>> there is no general purpose solution, but it typically involves you
>>> maintaining your own paging state, typically keeping track of the last
>>> partitioning and clustering key seen, and using that to construct your next
>>> query.
>>>
>>> For example, we have client queries which can span several partitioning
>>> keys.  We make sure that the List of partition keys generated by a given
>>> client query List(Pq) is deterministic, then our paging state is the
>>> index offset of the final Pq in the response, plus the value of the
>>> final clustering column.  A query coming in with a paging state attached to
>>> it starts the next set of queries from the provided Pq offset where
>>> clusteringKey > the provided value.
>>>
>>> So if you can just track partition key offset (if spanning multiple
>>> partitions), and clustering key offset, you can construct your next query
>>> from those instead.
>>>
>>> On Tue, Feb 10, 2015 at 6:58 PM, Ajay <ajay.ga...@gmail.com> wrote:
>>>
>>>> Thanks Alex.
>>>>
>>>> But is there any workaround possible?. I can't believe that everyone
>>>> read & process all rows at once (without pagination).
>>>>
>>>> Thanks
>>>> Ajay
>>>> On Feb 10, 2015 11:46 PM, "Alex Popescu" <al...@datastax.com> wrote:
>>>>
>>>>>
>>>>> On Tue, Feb 10, 2015 at 4:59 AM, Ajay <ajay.ga...@gmail.com> wrote:
>>>>>
>>>>>> 1) Java driver implicitly support Pagination in the ResultSet (using
>>>>>> Iterator) which can be controlled through FetchSize. But it is limited 
>>>>>> in a
>>>>>> way that we cannot skip or go previous. The FetchState is not exposed.
>>>>>
>>>>>
>>>>> Cassandra doesn't support skipping so this is not really a limitation
>>>>> of the driver.
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> [:>-a)
>>>>>
>>>>> Alex Popescu
>>>>> Sen. Product Manager @ DataStax
>>>>> @al3xandru
>>>>>
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to java-driver-user+unsubscr...@lists.datastax.com.
>>>>>
>>>>
>>>
>

Re: Pagination support on Java Driver Query API

Reply via email to