If you want distributed machine learning, you can use either Mahout (runs on Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a connector (CFS) to interact with data stored in Cassandra. Otherwise you can try to use the Cassandra InputFormat (not as simple, but plenty of people use it).
A quick search for “map reduce cassandra” on this list brings up a recent conversation: http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E If you prefer to use Spark, you can try the Datastax Cassandra connector: https://github.com/datastax/spark-cassandra-connector. This should let you run Spark jobs using data to/from Cassandra. Cheers, James Web: http://ferry.opencore.io Twitter: @open_core_io On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA <adaryl.wakefi...@hotmail.com> wrote: > Yes I remember this conversation. That was when I was just first stepping > into this stuff. My current understanding is: > Storm = Stream and micro batch > Spark = Batch and micro batch > > Micro batching is what gets you to exactly once processing semantics. I’m > clear on that. What I’m not clear on is how and where processing takes place. > > I also get the fact that Spark is a faster execution engine than MapReduce. > But we have Tez now..except, as far as I know, that’s not useful here because > my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and > Spark Shell but I’d really like to get this done with a minimum amount of > software; either Storm or Spark but not both. > > Trident ML isn’t distributed which is fine because I’m not trying to do > learning on the stream. For now, I’m just trying to do learning in batch and > then update parameters as suggested earlier. > > Let me simply the question. How do I do distributed machine learning when my > data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but > a lot of the algorithms run on MapReduce which is fine for now. As I > understand it though, MapReduce works on data in HDFS correct? > > Adaryl "Bob" Wakefield, MBA > Principal > Mass Street Analytics > 913.938.6685 > www.linkedin.com/in/bobwakefieldmba > Twitter: @BobLovesData > > From: Shahab Yunus > Sent: Saturday, August 30, 2014 11:23 AM > To: user@cassandra.apache.org > Subject: Re: Machine Learning With Cassandra > > Spark is not storage, rather it is a streaming framework supposed to be run > on big data, distributed architecture (a very high-level intro/definition). > It provides batched version of in-memory map/reduce like jobs. It is not > completely streaming like Storm but rather batches collection of tuples and > thus you can run complex ML algorithms relatively faster. > > I think we just discussed this a short while ago when similar question (storm > vs. spark, I think) was raised by you earlier. Here is the link for that > discussion: > http://markmail.org/message/lc4icuw4hobul6oh > > > Regards, > Shahab > > > On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA > <adaryl.wakefi...@hotmail.com> wrote: > Isn’t a bit overkill to use Storm and Spark in the architecture? You say load > it “into” Spark. Is Spark separate storage? > > B. > > From: Alex Kamil > Sent: Friday, August 29, 2014 10:46 PM > To: user@cassandra.apache.org > Subject: Re: Machine Learning With Cassandra > > Adaryl, > > most ML algorithms are based on some form of numerical optimization, using > something like online gradient descent or conjugate gradient (e.g in SVM > classifiers). In its simplest form it is a nested FOR loop where on each > iteration you update the weights or parameters of the model until reaching > some convergence threshold that minimizes the prediction error (usually the > goal is to minimize a Loss function, as in a popular least squares > technique). You could parallelize this loop using a brute force > divide-and-conquer approach, mapping a chunk of data to each node and a > computing partial sum there, then aggregating the results from each node into > a global sum in a 'reduce' stage, and repeating this map-reduce cycle until > convergence. You can look up distributed gradient descent or check out Mahout > or Spark MLlib for examples. Alternatively you can use something like > GraphLab. > > Cassandra can serve a data store from which you load the training data e.g. > into Spark using this connector and then train the model using MLlib or > Mahout (it has Spark bindings I believe). Once you trained the model, you > could save the parameters back in Cassandra. Then the next stage is using the > model to classify new data, e.g. recommend similar items based on a log of > new purchases, there you could once again use Spark or Storm with something > like this. > > Alex > > > > > On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA > <adaryl.wakefi...@hotmail.com> wrote: > I’m planning to speak at a local meet-up and I need to know if what I have in > my head is even possible. > > I want to give an example of working with data in Cassandra. I have data > coming in through Kafka and Storm and I’m saving it off to Cassandra (this is > only on paper at this point). I then want to run an ML algorithm over the > data. My problem here is, while my data is distributed, I don’t know how to > do the analysis in a distributed manner. I could certainly use R but > processing the data on a single machine would seem to defeat the purpose of > all this scalability. > > What is my solution? > B. > >