We also applied this patch to the 0.6 branch and have been running it for a bit over a week. Works well, would love to see it get into trunk/0.7 proper.
From: Ryan Daum [mailto:r...@thimbleware.com] Sent: Wednesday, March 31, 2010 11:49 AM To: user@cassandra.apache.org Subject: Re: expiring data out of Cassandra/time to live I was able to successfully merge this patch into the 0.6 branch a few weeks ago by doing the following: * Downloading the patch * Checking out the trunk of Cassandra from github * Rolling back (checking out) the git repo to the same date that the patch was submitted to Jira * Applying the patch * Committing to Git * Merging forward to the 0.6 branch * Resolve one or two minor conflicts. R On Wed, Mar 31, 2010 at 2:46 PM, Jonathan Ellis <jbel...@gmail.com<mailto:jbel...@gmail.com>> wrote: Sounds like you want to follow https://issues.apache.org/jira/browse/CASSANDRA-699. There is a patch there but I wouldn't recommend merging it if Java scares you. :) On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore <mike.e.gallam...@googlemail.com<mailto:mike.e.gallam...@googlemail.com>> wrote: > Hello everyone, > > I saw a thread on the incubator user chat that started a few months ago: > http://www.mail-archive.com/cassandra-u...@incubator.apache.org/msg02047.html > . It looks like this is the new official user mailing list so I'll add my > thoughts/question here. > > Is there any way to set a TTL on data stored in Cassandra? Deleting old > SSTables isn't enough for my needs. I need the data to go away after a fixed > period of time. Here is what I'm trying to do and my reasoning why I think > Cassandra and not something like Flare/Memcache mets my need: > > I'm building a reputation system. We get lots of data at my work (in the > 10's of GB of reputation data a day). The trick is that old data is not > useful as a senders ip address might have changed, they might have had a bot > on their system and no have removed it, etc. So I need to be able to keep > data for a fixed period of time and then afterwords it isn't needed/ideally > would be GC'd out. > > We want to do one thing if we either never heard of the individual or at > least not since the expiry time, and another thing based on the reputation > data that is stored in Cassandra if it is current. So ideally a Cassandra > call for a key for someone who's reputation is expired would return nothing > and we'd reply with our default reputation for that individual. There really > is no point using network bandwidth to return all the fields associated with > that key only to look at a timestamp and end up ignoring it anyways. > Similarly the latency of requesting first the timestamp and then the data in > two separate requests is prohibitive. > > Why Cassandra: > > Our data is complex and is hard to handle completely in a key/value sense. > In the past we were doing this and just encoding the complex structure > inside of JSON but this isn't ideal. It is very nice algorithmically to be > able to say: give me this column, or update this element of this hash etc, > rather than having to pull the old version, decode, modify, re-encode and > push back to a cache based system. > Our data is large (in the low TB's at the moment, but expected to grow to > 50-100TB of live data) > Need quick response for both searches and writes: typically for each thing > we track we get a request for the reputation, the message gets processed and > then we get feedback back from the recipient. So reads and writes are > symmetric. > High request rate: millions per hour > hundreds of millions of unique reputations (this is way crawling though the > data with a script purging old data doesn't make sense) > Availablity/load balancing a must. Data needs to be replicated a disk copy > is useful so if we have a power outage we don't lose the system. > It would be interesting to keep a local subset of our data at customers > sites and have them "replicate up" there data rather than send there > feedback in a different manner that then has to be processed and pumped into > our datastore (hopefully this is possible with Cassandra with some creative > choices of how the data is hashed between nodes) > > Does the capability to set an expiry time exist? If not is there any plans > to add it? My java experience is very limited (I'm accessing Cassandra via > thrift/Perl) so it isn't something I'd be able to jump in and run with > myself. >