On 8/19/10 11:14 AM, Mark wrote:
On 8/19/10 10:23 AM, Jeremy Hanna wrote:
I would check out http://wiki.apache.org/cassandra/HadoopSupport for
more info. I'll try to explain a bit more here, but I don't think
there's a tutorial out there yet.
For input:
- configure your main class where you're starting the mapreduce job
the way the word_count is configured (with either storage-conf or in
your code via the ConfigHelper). It will complain specifically about
stuff you hadn't configured - esp. important is your cassandra server
and port.
- the inputs to your mapper are going to be what's coming from
cassandra - so your key with a map of row values
- you need to set your column name in your overridden setup method in
your mapper
- for the reducer, nothing really changes from a normal map/reduce,
unless you want to output to cassandra
- generally cassandra just provides an inputformat and split classes
to read from cassandra - you can find the guts in the
org.apache.cassandra.hadoop package
For output:
- in your reducer, you could just write to cassandra directly via
thrift. there is a built-in outputformat coming in 0.7 but it still
might change before 0.7 final - that will queue up changes so it will
write large blocks all at once.
On Aug 19, 2010, at 12:07 PM, Mark wrote:
Are there any examples/tutorials on the web for reading/writing from
Cassandra into/from Hadoop?
I found the example in contrib/word_count but I really can't make
sense of it... a tutorial/explanation would help.
Thanks!
How does batching across all rows work? Does it just take an arbitrary
start w/ a limit of x and then use the last key from that result as the
next start? Does this work with RandomPartitioner?