Hello everyone,

I saw a thread on the incubator user chat that started a few months ago: http://www.mail-archive.com/cassandra-u...@incubator.apache.org/msg02047.html . It looks like this is the new official user mailing list so I'll add my thoughts/question here.

Is there any way to set a TTL on data stored in Cassandra? Deleting old SSTables isn't enough for my needs. I need the data to go away after a fixed period of time. Here is what I'm trying to do and my reasoning why I think Cassandra and not something like Flare/Memcache mets my need:

I'm building a reputation system. We get lots of data at my work (in the 10's of GB of reputation data a day). The trick is that old data is not useful as a senders ip address might have changed, they might have had a bot on their system and no have removed it, etc. So I need to be able to keep data for a fixed period of time and then afterwords it isn't needed/ideally would be GC'd out.

We want to do one thing if we either never heard of the individual or at least not since the expiry time, and another thing based on the reputation data that is stored in Cassandra if it is current. So ideally a Cassandra call for a key for someone who's reputation is expired would return nothing and we'd reply with our default reputation for that individual. There really is no point using network bandwidth to return all the fields associated with that key only to look at a timestamp and end up ignoring it anyways. Similarly the latency of requesting first the timestamp and then the data in two separate requests is prohibitive.

Why Cassandra:

   * Our data is complex and is hard to handle completely in a
     key/value sense. In the past we were doing this and just encoding
     the complex structure inside of JSON but this isn't ideal. It is
     very nice algorithmically to be able to say: give me this column,
     or update this element of this hash etc, rather than having to
     pull the old version, decode, modify, re-encode and push back to a
     cache based system.
   * Our data is large (in the low TB's at the moment, but expected to
     grow to 50-100TB of live data)
   * Need quick response for both searches and writes: typically for
     each thing we track we get a request for the reputation, the
     message gets processed and then we get feedback back from the
     recipient. So reads and writes are symmetric.
   * High request rate: millions per hour
   * hundreds of millions of unique reputations (this is way crawling
     though the data with a script purging old data doesn't make sense)
   * Availablity/load balancing a must. Data needs to be replicated a
     disk copy is useful so if we have a power outage we don't lose the
     system.
   * It would be interesting to keep a local subset of our data at
     customers sites and have them "replicate up" there data rather
     than send there feedback in a different manner that then has to be
     processed and pumped into our datastore (hopefully this is
     possible with Cassandra with some creative choices of how the data
     is hashed between nodes)

Does the capability to set an expiry time exist? If not is there any plans to add it? My java experience is very limited (I'm accessing Cassandra via thrift/Perl) so it isn't something I'd be able to jump in and run with myself.

Reply via email to