" We are confident that we are doing everything right in both cases (no
bugs), yet the results are baffling. Tests in smaller, single-node
environments results in consistent counts between the two methods, but we
don't have the same amount of data nor the same topology. "

Are you somehow using an other consistency level?

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E ro...@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/15 Jeremy Hanna <jeremy.hanna1...@gmail.com>

> Are there any deletions in your data?  The Hadoop support doesn't filter
> out tombstones, though you may not be filtering them out in your code
> either.  I've used the hadoop support for doing a lot of data validation in
> the past and as long as you're sure that the code is sound, I'm pretty
> confident in it.
>
> On Sep 14, 2012, at 10:07 PM, Todd Fast <t...@conga.com> wrote:
>
> > Hi--
> >
> > We are iterating rows in a column family two different ways and are
> seeing radically different row counts. We are using 1.0.8 and
> RandomPartitioner on a 3-node cluster.
> >
> > In the first case, we have a trivial Hadoop job that counts 29M rows
> using the standard MR pattern for counting (mapper outputs a single key
> with a value of 1, reducer adds up all the values).
> >
> > In the second case, we have a simple Quartz batch job which counts only
> 10M rows. We are iterating using chained calls to get_row_slices, as
> described on the wiki: http://wiki.apache.org/cassandra/FAQ#iter_worldWe've 
> also implemented the batch job using Pelops, with and without
> chaining. In all cases, the job counts just 10M rows, and it is not
> encountering any errors.
> >
> > We are confident that we are doing everything right in both cases (no
> bugs), yet the results are baffling. Tests in smaller, single-node
> environments results in consistent counts between the two methods, but we
> don't have the same amount of data nor the same topology.
> >
> > Is the right answer 29M or 10M? Any clues to what we're seeing?
> >
> > Todd
>
>

Reply via email to