On Wed, Oct 27, 2010 at 03:24, Arijit Mukherjee <ariji...@gmail.com> wrote:
> Hi All
>
> I've another related question.
>
> I am using a stream of records of the form (A, B, n) where the pair
> (A,B) can occur multiple times. For example, you could have the
> following rset of records -
>
> A, B, 2
> P, Q, 5
> X, Y, 3
> A, B, 8
> A, B, 2
> ...
>
>
> The data store has a set of columns - (key, count, sum). Because of
> the possibility of duplicate A and B, I am using the string A+B as my
> key. Every time there is a duplicate A+B, I update a count field, and
> add "n" to the existing value of sum. So, for the above set of
> records, cassandra should actually hold the following set -
>
> A+B, 3, 12
> P+Q, 1, 5
> X+Y, 1, 3
> ...

You want a distributed counter.

>
> My question is - is it possible to have multiple threads reading
> different streams so that I can parallelize the insertion mechanism?
> What may happen if two threads try to insert two different records
> with the same A+B key?
>

No, this isn't going to work.  At some point Cassandra will have
distributed counters, probably with a few caveats.  See
https://issues.apache.org/jira/browse/CASSANDRA-1546 and related
tickets for more information.

The best approach I can suggest at this point is to continue inserting
the increments as column names and then manually sum them up when you
need to.  There are several approaches you could take if you're
interested in consolidating slices of the increments that would be
reasonably safe against the possibility of concurrent updates.

Gary.

Reply via email to