Hello everyone,
I saw a thread on the incubator user chat that started a few months ago:
http://www.mail-archive.com/cassandra-u...@incubator.apache.org/msg02047.html
. It looks like this is the new official user mailing list so I'll add
my thoughts/question here.
Is there any way to set a TTL on data stored in Cassandra? Deleting old
SSTables isn't enough for my needs. I need the data to go away after a
fixed period of time. Here is what I'm trying to do and my reasoning why
I think Cassandra and not something like Flare/Memcache mets my need:
I'm building a reputation system. We get lots of data at my work (in the
10's of GB of reputation data a day). The trick is that old data is not
useful as a senders ip address might have changed, they might have had a
bot on their system and no have removed it, etc. So I need to be able to
keep data for a fixed period of time and then afterwords it isn't
needed/ideally would be GC'd out.
We want to do one thing if we either never heard of the individual or at
least not since the expiry time, and another thing based on the
reputation data that is stored in Cassandra if it is current. So ideally
a Cassandra call for a key for someone who's reputation is expired would
return nothing and we'd reply with our default reputation for that
individual. There really is no point using network bandwidth to return
all the fields associated with that key only to look at a timestamp and
end up ignoring it anyways. Similarly the latency of requesting first
the timestamp and then the data in two separate requests is prohibitive.
Why Cassandra:
* Our data is complex and is hard to handle completely in a
key/value sense. In the past we were doing this and just encoding
the complex structure inside of JSON but this isn't ideal. It is
very nice algorithmically to be able to say: give me this column,
or update this element of this hash etc, rather than having to
pull the old version, decode, modify, re-encode and push back to a
cache based system.
* Our data is large (in the low TB's at the moment, but expected to
grow to 50-100TB of live data)
* Need quick response for both searches and writes: typically for
each thing we track we get a request for the reputation, the
message gets processed and then we get feedback back from the
recipient. So reads and writes are symmetric.
* High request rate: millions per hour
* hundreds of millions of unique reputations (this is way crawling
though the data with a script purging old data doesn't make sense)
* Availablity/load balancing a must. Data needs to be replicated a
disk copy is useful so if we have a power outage we don't lose the
system.
* It would be interesting to keep a local subset of our data at
customers sites and have them "replicate up" there data rather
than send there feedback in a different manner that then has to be
processed and pumped into our datastore (hopefully this is
possible with Cassandra with some creative choices of how the data
is hashed between nodes)
Does the capability to set an expiry time exist? If not is there any
plans to add it? My java experience is very limited (I'm accessing
Cassandra via thrift/Perl) so it isn't something I'd be able to jump in
and run with myself.