Hi Robert,

First of all, thanks for answering.


2014-07-21 20:18 GMT-03:00 Robert Coli <rc...@eventbrite.com>:

> You're wrong, unless you're talking about insertion into a memtable, which
> you probably aren't and which probably doesn't actually work that way
> enough to be meaningful.
>
> On disk, Cassandra has immutable datafiles, from which row fragments are
> merged into a row at read time. I'm pretty sure the rest of the stuff you
> said doesn't make any sense in light of this?
>

Although several sstables (disk fragments) may have the same row key,
inside a single sstable row keys and column keys are indexed, right?
Otherwise, doing a GET in Cassandra would take some time.
>From the M/R perspective, I was reffering to the mem table, as I am trying
to compare the time to insert in Cassandra against the time of sorting in
hadoop.

To make it more clear: hadoop has it's own partitioner, which is used after
the map phase. The map output is written locally on each hadoop node, then
it's shuffled from one node to the other (see slide 17 in this
presentation: http://pt.slideshare.net/j_singh/nosql-and-mapreduce). In
other words, you may read Cassandra data on hadoop, but the intermediate
results are still stored in HDFS.

Instead of using hadoop partitioner, I would like to store the intermediate
results in a Cassandra CF, so the map output would go directly to an
intermediate column family via batch inserts, instead of being written to a
local disk first, then shuffled to the right node.

Therefore, the mapper would write it's output the same way all data enters
in Cassandra: first on a memtable, then being flush to a sstable, then read
during the reduce phase.

Shouldn't it be faster than storing intermediate results in HDFS?

Best regards,
Marcelo.

Reply via email to