Hi Robert, First of all, thanks for answering.
2014-07-21 20:18 GMT-03:00 Robert Coli <rc...@eventbrite.com>: > You're wrong, unless you're talking about insertion into a memtable, which > you probably aren't and which probably doesn't actually work that way > enough to be meaningful. > > On disk, Cassandra has immutable datafiles, from which row fragments are > merged into a row at read time. I'm pretty sure the rest of the stuff you > said doesn't make any sense in light of this? > Although several sstables (disk fragments) may have the same row key, inside a single sstable row keys and column keys are indexed, right? Otherwise, doing a GET in Cassandra would take some time. >From the M/R perspective, I was reffering to the mem table, as I am trying to compare the time to insert in Cassandra against the time of sorting in hadoop. To make it more clear: hadoop has it's own partitioner, which is used after the map phase. The map output is written locally on each hadoop node, then it's shuffled from one node to the other (see slide 17 in this presentation: http://pt.slideshare.net/j_singh/nosql-and-mapreduce). In other words, you may read Cassandra data on hadoop, but the intermediate results are still stored in HDFS. Instead of using hadoop partitioner, I would like to store the intermediate results in a Cassandra CF, so the map output would go directly to an intermediate column family via batch inserts, instead of being written to a local disk first, then shuffled to the right node. Therefore, the mapper would write it's output the same way all data enters in Cassandra: first on a memtable, then being flush to a sstable, then read during the reduce phase. Shouldn't it be faster than storing intermediate results in HDFS? Best regards, Marcelo.