If you want distributed machine learning, you can use either Mahout (runs on 
Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a 
connector (CFS) to interact with data stored in Cassandra. Otherwise you can 
try to use the Cassandra InputFormat (not as simple, but plenty of people use 
it). 

A quick search for “map reduce cassandra” on this list brings up a recent 
conversation: 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E
 

If you prefer to use Spark, you can try the Datastax Cassandra connector: 
https://github.com/datastax/spark-cassandra-connector. This should let you run 
Spark jobs using data to/from Cassandra. 

Cheers, 
James

Web: http://ferry.opencore.io
Twitter: @open_core_io

On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA 
<adaryl.wakefi...@hotmail.com> wrote:

> Yes I remember this conversation. That was when I was just first stepping 
> into this stuff. My current understanding is:
> Storm = Stream and micro batch
> Spark  = Batch and micro batch
>  
> Micro batching is what gets you to exactly once processing semantics. I’m 
> clear on that. What I’m not clear on is how and where processing takes place.
>  
> I also get the fact that Spark is a faster execution engine than MapReduce. 
> But we have Tez now..except, as far as I know, that’s not useful here because 
> my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and 
> Spark Shell but I’d really like to get this done with a minimum amount of 
> software; either Storm or Spark but not both. 
>  
> Trident ML isn’t distributed which is fine because I’m not trying to do 
> learning on the stream. For now, I’m just trying to do learning in batch and 
> then update parameters as suggested earlier.
>  
> Let me simply the question. How do I do distributed machine learning when my 
> data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but 
> a lot of the algorithms run on MapReduce which is fine for now. As I 
> understand it though, MapReduce works on data in HDFS correct?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: Shahab Yunus
> Sent: Saturday, August 30, 2014 11:23 AM
> To: user@cassandra.apache.org
> Subject: Re: Machine Learning With Cassandra
>  
> Spark is not storage, rather it is a streaming framework supposed to be run 
> on big data, distributed architecture (a very high-level intro/definition). 
> It provides batched version of in-memory map/reduce like jobs. It is not 
> completely streaming like Storm but rather batches collection of tuples and 
> thus you can run complex ML algorithms relatively faster. 
>  
> I think we just discussed this a short while ago when similar question (storm 
> vs. spark, I think) was raised by you earlier. Here is the link for that 
> discussion:
> http://markmail.org/message/lc4icuw4hobul6oh
>  
>  
> Regards,
> Shahab
> 
> 
> On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA 
> <adaryl.wakefi...@hotmail.com> wrote:
> Isn’t a bit overkill to use Storm and Spark in the architecture? You say load 
> it “into” Spark. Is Spark separate storage?
>  
> B.
>  
> From: Alex Kamil
> Sent: Friday, August 29, 2014 10:46 PM
> To: user@cassandra.apache.org
> Subject: Re: Machine Learning With Cassandra
>  
> Adaryl,
>  
> most ML algorithms  are based on some form of numerical optimization, using 
> something like online gradient descent or conjugate gradient (e.g in SVM 
> classifiers). In its simplest form it is a nested FOR loop where on each 
> iteration you update the weights or parameters of the model until reaching 
> some convergence threshold that minimizes the prediction error (usually the 
> goal is to minimize  a Loss function, as in a popular least squares 
> technique). You could parallelize this loop using a brute force 
> divide-and-conquer approach, mapping a chunk of data to each node and a 
> computing partial sum there, then aggregating the results from each node into 
> a global sum in a 'reduce' stage, and repeating this map-reduce cycle until 
> convergence. You can look up distributed gradient descent or check out Mahout 
> or Spark MLlib for examples. Alternatively you can use something like 
> GraphLab.
>  
> Cassandra can serve a data store from which you load the training data e.g. 
> into Spark  using this connector and then train the model using MLlib or 
> Mahout (it has Spark bindings I believe). Once you trained the model, you 
> could save the parameters back in Cassandra. Then the next stage is using the 
> model to classify new data, e.g. recommend similar items based on a log of 
> new purchases, there you could once again use Spark or Storm with something 
> like this.
>  
> Alex
>  
>  
> 
> 
> On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA 
> <adaryl.wakefi...@hotmail.com> wrote:
> I’m planning to speak at a local meet-up and I need to know if what I have in 
> my head is even possible.
>  
> I want to give an example of working with data in Cassandra. I have data 
> coming in through Kafka and Storm and I’m saving it off to Cassandra (this is 
> only on paper at this point). I then want to run an ML algorithm over the 
> data. My problem here is, while my data is distributed, I don’t know how to 
> do the analysis in a distributed manner. I could certainly use R but 
> processing the data on a single machine would seem to defeat the purpose of 
> all this scalability.
>  
> What is my solution?
> B.
>  
>  

Reply via email to