Thanks a lot Jonathan and everyone else that replied to my thread. This looks like it will do what I need. I have a colleague that is a Java wizard and will probably have no problem putting this patch into place for our production builds.

I'm a C/C++ programmer at heart so the code itself doesn't scare me, just my lack of java nuances lead me not to want to try adding this myself.
On 03/31/2010 11:46 AM, Jonathan Ellis wrote:
Sounds like you want to follow
https://issues.apache.org/jira/browse/CASSANDRA-699.  There is a patch
there but I wouldn't recommend merging it if Java scares you. :)

On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore
<mike.e.gallam...@googlemail.com>  wrote:
Hello everyone,

I saw a thread on the incubator user chat that started a few months ago:
http://www.mail-archive.com/cassandra-u...@incubator.apache.org/msg02047.html
. It looks like this is the new official user mailing list so I'll add my
thoughts/question here.

Is there any way to set a TTL on data stored in Cassandra? Deleting old
SSTables isn't enough for my needs. I need the data to go away after a fixed
period of time. Here is what I'm trying to do and my reasoning why I think
Cassandra and not something like Flare/Memcache mets my need:

I'm building a reputation system. We get lots of data at my work (in the
10's of GB of reputation data a day). The trick is that old data is not
useful as a senders ip address might have changed, they might have had a bot
on their system and no have removed it, etc. So I need to be able to keep
data for a fixed period of time and then afterwords it isn't needed/ideally
would be GC'd out.

We want to do one thing if we either never heard of the individual or at
least not since the expiry time, and another thing based on the reputation
data that is stored in Cassandra if it is current. So ideally a Cassandra
call for a key for someone who's reputation is expired would return nothing
and we'd reply with our default reputation for that individual. There really
is no point using network bandwidth to return all the fields associated with
that key only to look at a timestamp and end up ignoring it anyways.
Similarly the latency of requesting first the timestamp and then the data in
two separate requests is prohibitive.

Why Cassandra:

Our data is complex and is hard to handle completely in a key/value sense.
In the past we were doing this and just encoding the complex structure
inside of JSON but this isn't ideal. It is very nice algorithmically to be
able to say: give me this column, or update this element of this hash etc,
rather than having to pull the old version, decode, modify, re-encode and
push back to a cache based system.
Our data is large (in the low TB's at the moment, but expected to grow to
50-100TB of live data)
Need quick response for both searches and writes: typically for each thing
we track we get a request for the reputation, the message gets processed and
then we get feedback back from the recipient. So reads and writes are
symmetric.
High request rate: millions per hour
hundreds of millions of unique reputations (this is way crawling though the
data with a script purging old data doesn't make sense)
Availablity/load balancing a must. Data needs to be replicated a disk copy
is useful so if we have a power outage we don't lose the system.
It would be interesting to keep a local subset of our data at customers
sites and have them "replicate up" there data rather than send there
feedback in a different manner that then has to be processed and pumped into
our datastore (hopefully this is possible with Cassandra with some creative
choices of how the data is hashed between nodes)

Does the capability to set an expiry time exist? If not is there any plans
to add it? My java experience is very limited (I'm accessing Cassandra via
thrift/Perl) so it isn't something I'd be able to jump in and run with
myself.


Reply via email to