Differences in row iteration behavior

Todd Fast Fri, 14 Sep 2012 20:07:48 -0700

Hi--

We are iterating rows in a column family two different ways and areseeing radically different row counts. We are using 1.0.8 andRandomPartitioner on a 3-node cluster.

In the first case, we have a trivial Hadoop job that counts 29M rowsusing the standard MR pattern for counting (mapper outputs a single keywith a value of 1, reducer adds up all the values).

In the second case, we have a simple Quartz batch job which counts only10M rows. We are iterating using chained calls to get_row_slices, asdescribed on the wiki: http://wiki.apache.org/cassandra/FAQ#iter_worldWe've also implemented the batch job using Pelops, with and withoutchaining. In all cases, the job counts just 10M rows, and it is notencountering any errors.

We are confident that we are doing everything right in both cases (nobugs), yet the results are baffling. Tests in smaller, single-nodeenvironments results in consistent counts between the two methods, butwe don't have the same amount of data nor the same topology.


Is the right answer 29M or 10M? Any clues to what we're seeing?

Todd

Differences in row iteration behavior

Reply via email to