Re: Finding new Cassandra data

Phil Stanhope Tue, 22 Jun 2010 08:58:10 -0700

I can envision two fundamentally different approaches:

1. A CF that is CompareWith LONG ... use microsecond timestamps as your keys 
... then you can filter by time ranges.

This implies that you are willing to do a double write (once for the original 
data and then again for the logging). And a third read of a range_slice (which 
will most likely require pagination) to determine what to then push into your 
other system.

Which begs a question ... if you know you are inserting and generating keys ... 
and you know the keyname ... why not simply push the key into a queue 
(non-Cassandra) and do processing against that. So ...

2. Don't store new row keys in a CF ... at the point of using the thrift API 
simply build a log of new keys and process that log asynchronously.

This approach causes you to ask yourself another question: of the nodes in my 
cluster, am I willing to declare that some of those nodes are only available 
for write-thru processing. It's not Cassandra's job to make these decisions for 
you ... it's an applications decision. If you allow all nodes to perform 
writes, then you'll either have to consolidate logs or introduce some form of 
common queue for coordination of the async updates to non-Cassandra data stores.

-phil

On Jun 22, 2010, at 11:18 AM, Gary Dusbabek wrote:

> On Tue, Jun 22, 2010 at 09:59, David Boxenhorn <da...@lookin2.com> wrote:
>> In my system, I have a Cassandra front end, and an Oracle back end. Some
>> information is created in the back end, and pushed out to the front end, and
>> some information is created in the front end and pulled into the back end.
>> 
>> Question: How do I locate new rows that have been crated in Cassandra, for
>> import into Oracle?
>> 
>> I'm thinking of having a special column family "newRows" that contains only
>> the keys of the new rows. The offline process would look there to see what's
>> new, then delete those rows. The "newRows" CF would have no data! (The data
>> would be in the "real" CF.)
> 
> I've never tried an empty row, but I'm pretty sure you need at least one 
> column.
> 
>> 
>> Is this a good solution? It seems weird to have a CF with rows but no data.
>> But I can't think of a better way.
>> 
>> Any thoughts?
> 
> Another approach would be to have a CF with a single row whose column
> names refer to the new row ids.  This would allow you efficient
> slicing.  The downside is that you'd need to make sure the row doesn't
> get too wide.  So depending on your throughput and application
> behavior, this may or may not work.
> 
> Gary.

Re: Finding new Cassandra data

Reply via email to