OK...I just verified on a clean EC2 small single instance box using apache-cassandra-0.6.2-src. I'm pertty sure the Cassandra MapReduce functionality is broken.
If your MapReduce jobs are idempotent then you are OK, but if you are doing things like word count (as in the supplied example) or key count you will get double counts. -Corey On Fri, Jun 18, 2010 at 3:15 PM, Corey Hulen <c...@earnstone.com> wrote: > > I thought the same thing, but using the supplied contrib example I just > delete the /var/lib/data dirs and commit log. > > -Corey > > > > > On Fri, Jun 18, 2010 at 3:11 PM, Phil Stanhope <pstanh...@wimba.com>wrote: > >> "blow all the data away" ... how do you do that? What is the timestamp >> precision that you are using when creating key/col or key/supercol/col >> items? >> >> I have seen a fail to write a key when the timestamp is identical to the >> previous timestamp of a deleted key/col. While I didn't examine the source >> code, I'm certain that this is do to delete tombstones. >> >> I view this as a application error because I was attempting to do this >> within the GCGraceSeconds time period. If I, however, stopped cassandra, >> blew away data & commitlogs and restarted the write always succeeds (no >> surprise there). >> >> I turned this behavior into a feature (of sorts). When this happens I >> increment a formally non-zero portion of the timestamp (the last digit of >> precision which was always zero) and use this as a counter to track how many >> times a key/col was updated (max 9 for my purposes). >> >> -phil >> >> On Jun 18, 2010, at 5:49 PM, Corey Hulen wrote: >> >> > >> > We are using MapReduce to periodical verify and rebuild our secondary >> indexes along with counting total records. We started to noticed double >> counting of unique keys on single machine standalone tests. We were finally >> able to reproduce the problem using the >> apache-cassandra-0.6.2-src/contrib/word_count example and just re-running it >> multiple times. We are hoping someone can verify the bug. >> > >> > re-run the tests and the word count for /tmp/word_count3/part-r-00000 >> will be 1000 +~200 and will change if you blow the data away and re-run. >> Notice the setup script loops and only inserts 1000 records so we expect >> count to be 1000. Once the data is generated then re-running the setup >> script and/or mapreduce doesn't change the number (still off). The key is >> to blow all the data away and start over which will cause it to change. >> > >> > Can someone please verify this behavior? >> > >> > -Corey >> >> >