Re: implementing a 'sorted set' on top of cassandra

Mike Torra Tue, 17 Jan 2017 08:48:52 -0800

Thanks for the feedback everyone! Redis `zincryby` and `zrangebyscore` is 
indeed what we use today.


Caching the resulting 'sorted sets' in redis is exactly what I plan to do. 
There will be tens of thousands of these sorted sets, each generally with <10k 
items (with maybe a few exceptions going a bit over that). The reason to 
periodically calculate the set and store it in cassandra is to avoid having the 
client do that work, when the client only really cares about the top 100 or so 
items at any given time. Being truly "real time" is not critical for us, but it 
is a selling point to be as up to date as possible.

I'd like to understand the performance issue of frequently updating these sets. 
I understand that every time I 'regenerate' the sorted set, any rows that 
change will create a tombstone - for example, if "item_1" is in first place and 
"item_2" is in second place, then they switch on the next update, that would be 
two tombstones. Do you think this will be a big enough problem that it is worth 
doing the sorting work client side, on demand, and just try to eat the 
performance hit there? My thought was to make a tradeoff by using more 
cassandra disk space (ie pre calculating all sets), in exchange for faster 
reads when requests actually come in that need this data.

From: Benjamin Roth <benjamin.r...@jaumo.com<mailto:benjamin.r...@jaumo.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Saturday, January 14, 2017 at 1:25 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: implementing a 'sorted set' on top of cassandra

Mike mentioned "increment" in his initial post. That let me think of a case 
with increments and fetching a top list by a counter like
https://redis.io/commands/zincrby
https://redis.io/commands/zrangebyscore

1. Cassandra is absolutely not made to sort by a counter (or a non-counter 
numeric incrementing value) but it is made to store counters. In this case a 
partition could be seen as a set.
2. I thought of CS for persistence and - depending on the app requirements like 
real-time and set size - still use redis as a read cache

2017-01-14 18:45 GMT+01:00 Jonathan Haddad 
<j...@jonhaddad.com<mailto:j...@jonhaddad.com>>:
Sorted sets don't have a requirement of incrementing / decrementing. They're 
commonly used for thing like leaderboards where the values are arbitrary.

In Redis they are implemented with 2 data structures for efficient lookups of 
either key or value. No getting around that as far as I know.

In Cassandra they would require using the score as a clustering column in order 
to select top N scores (and paginate). That means a tombstone whenever the 
value for a key in the set changes. In sets with high rates of change that 
means a lot of tombstones and thus terrible performance.
On Sat, Jan 14, 2017 at 9:40 AM DuyHai Doan 
<doanduy...@gmail.com<mailto:doanduy...@gmail.com>> wrote:
Sorting on an "incremented" numeric value has always been a nightmare to be 
done properly in C*

Either use Counter type but then no sorting is possible since counter cannot be 
used as type for clustering column (which allows sort)

Or use simple numeric type on clustering column but then to increment the value 
*concurrently* and *safely* it's prohibitive (SELECT to fetch current value + 
UPDATE ... IF value = <old_value>) + retry



On Sat, Jan 14, 2017 at 8:54 AM, Benjamin Roth 
<benjamin.r...@jaumo.com<mailto:benjamin.r...@jaumo.com>> wrote:
If your proposed solution is crazy depends on your needs :)
It sounds like you can live with not-realtime data. So it is ok to cache it. 
Why preproduce the results if you only need 5% of them? Why not use redis as a 
cache with expiring sorted sets that are filled on demand from cassandra 
partitions with counters?
So redis has much less to do and can scale much better. And you are not limited 
on keeping all data in ram as cache data is volatile and can be evicted on 
demand.
If this is effective also depends on the size of your sets. CS wont be able to 
sort them by score for you, so you will have to load the complete set to redis 
for caching and / or do sorting in your app on demand. This certainly won't 
work out well with sets with millions of entries.

2017-01-13 23:14 GMT+01:00 Mike Torra 
<mto...@demandware.com<mailto:mto...@demandware.com>>:
We currently use redis to store sorted sets that we increment many, many times 
more than we read. For example, only about 5% of these sets are ever read. We 
are getting to the point where redis is becoming difficult to scale (currently 
at >20 nodes).

We've started using cassandra for other things, and now we are experimenting to 
see if having a similar 'sorted set' data structure is feasible in cassandra. 
My approach so far is:

  1.  Use a counter CF to store the values I want to sort by
  2.  Periodically read in all key/values in the counter CF and sort in the 
client application (~every five minutes or so)
  3.  Write back to a different CF with the ordered keys I care about

Does this seem crazy? Is there a simpler way to do this in cassandra?



--
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com<http://www.jaumo.com>
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6<tel:+49%207161%203048806> · Fax +49 7161 
304880-1<tel:+49%207161%203048801>
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer




--
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com<http://www.jaumo.com>
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: implementing a 'sorted set' on top of cassandra

Reply via email to