Re: skip + limit support in GetSlice

Michal Augustýn Mon, 06 Sep 2010 01:48:42 -0700

Thank you! This solve my issue.

But what about index recomputing (after new columns are inserted) ?


Should I use asynchronous triggers?
https://issues.apache.org/jira/browse/CASSANDRA-1311
<https://issues.apache.org/jira/browse/CASSANDRA-1311>Or will 0.7's
secondary indexes handle this?

Augi

2010/9/6 Dr. Martin Grabmüller <martin.grabmuel...@eleven.de>

>  Have you considered creating a second column family which acts as an
> index for
> the original column family?  Have the record number as the column name, and
> the
> value as the identifier (primary key) of the original data, and do a
>
> 1.  get_slice(<index_column_family>, start='00051235', finish='',
> limit=100)
> 2.  get_slice(<original_column_family>, columns=<list of returned column
> values>)
>
> This way, only 100 columns are returned on the first call, and 100 columns
> (or super columns)
> on the second.  You have two calls instead of one, but it should be faster
> because
> much less data is transferred (and the latency can be hidden by
> concurrency).
>
> Martin
>
>  ------------------------------
> *From:* Michal Augustýn [mailto:augustyn.mic...@gmail.com]
> *Sent:* Monday, September 06, 2010 10:26 AM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: skip + limit support in GetSlice
>
> Hi Mike,
>
> yes, I read the PDF to the finish. Twice. As I wrote, my application is not
> accessed by users, it's accessed by other applications that can access pages
> randomly.
>
> So when some application wants to get page 51235 (so skip is 5123500, limit
> is 100) then I have to:
>
> 1) GetSlice(from: "", to: "", limit: 5123500)
> 2) Read only the last column name.
> 3) GetSlice(from: point2value, to: "", limit: 100)
>
> The problem is in 1) - Cassandra has to read 5123500 columns, serialize
> them, send them using  Thrift protocol and deserialize them. Finally, I
> throw 5123499 of columns away. It doesn't seem to be very efficient.
>
> So I'm looking for another solution for this scenario. I know the right way
> for pagination in Cassandra and I'm using them if I can...
>
> So if this kind of pagination cannot be added to standard Cassandra Thrift
> API then I should create some separate Thrift API that will handle my
> scenario (and avoid high network traffic). Am I right?
>
> Thanks!
>
> Augi
>
>
> 2010/9/5 Mike Peters <cassan...@softwareprojects.com>
>
>> Hi Michal,
>>
>> Did you read the PDF Stu sent over, start to finish?  There are several
>> different approaches described there.
>>
>> With Cassandra, what we found works best for pagination:
>> * Keep a separate 'total_records' count and increment/decrement it on
>> every insert/delete
>> * When getting slices, pass 'last seen' as the 'from' and keep the 'to'
>> empty.  Pass the number of records you want to show per page in the 'count'.
>> * Avoid letting user skip to page X, using Next/Prev/First/Last only (same
>> way GMail does it)
>>
>>
>>
>> Michal Augustýn wrote:
>>
>> I know that "Prev/Next" is good solution for web applications. But when I
>> want to access data from another application or when I want to access pages
>> randomly...
>>
>> I don't know the internal structure of memtables etc., so I don't know if
>> columns in row are indexable. If now, then I just want to transfer my
>> workaround to server (to avoid huge network traffic)...
>>
>> 2010/9/5 Stu Hood <stu.h...@rackspace.com>
>>
>>> Cassandra supports the recommended approach from:
>>> http://www.percona.com/ppc2009/PPC2009_mysql_pagination.pdf
>>>
>>> For large numbers of items, skip + limit is extremely inefficent.
>>>
>>> -----Original Message-----
>>> From: "Michal Augustýn" <augustyn.mic...@gmail.com>
>>> Sent: Sunday, September 5, 2010 5:39am
>>> To: user@cassandra.apache.org
>>> Subject: skip + limit support in GetSlice
>>>
>>> Hello,
>>>
>>> probably this is feature request. Simply, I would like to have support
>>> for
>>> standard pagination (skip + limit) in GetSlice Thrift method. Is this
>>> feature on the road map?
>>>
>>> Now, I have to perform GetSlice call, that starts on "" and "limit" is
>>> set
>>> to "skip" value. Then I read the last column name returned and
>>> subsequently
>>> perform the final GetSlice call - I use the last column name as "start"
>>> and
>>> set "limit" to "limit" value.
>>>
>>> This workaround is not very efficient when I need to skip a lot of
>>> columns
>>> (so "skip" is high) - then a lot of data must be transferred via network.
>>> So
>>> I think that support for Skip in GetSlice would be very useful (to avoid
>>> high network traffic).
>>>
>>> The implementation could be very straightforward (same as the workaround)
>>> or
>>> maybe it could be more efficient - I think that whole row (so all
>>> columns)
>>> must fit into memory so if we have all columns in memory...
>>>
>>> Thank you!
>>>
>>> Augi
>>>
>>>
>>>
>>
>>
>

Re: skip + limit support in GetSlice

Reply via email to