Johnny, Currently, probably the easiest (and most performant way) to integrate Spark and Cassandra is using the spark-cassandra-connector [1]
Given an rdd, saving it to cassandra is as easy as: rdd.saveToCassandra(keyspace, table, Seq(columns)) We tried many 'hand crafted' options to interact with Cassandra before and this connector is the way to go. This is currently in the Scala realm, and a reason heavy enough to tilt your balance to Scala. It also sounds that your current Python-based architecture needs a review, so the migration could give you the opportunity for a fresh re-design. -kr, Gerard. [1] https://github.com/datastax/spark-cassandra-connector On Thu, Sep 4, 2014 at 5:03 PM, Johnny Kelsey <jkkel...@semblent.com> wrote: > Hi guys, > > We're testing out a spark/cassandra cluster, & we're very impressed with > what we've seen so far. However, I'd very much like some advice from the > shiny brains on the mailing list. > > We have a large collection of python code that we're in the process of > adapting to move into spark/cassandra, & I have some misgivings on using > python for any further development. > > As a concrete example, we have a python class (part of a fairly large > class library) which, as part of its constructor, also creates a record of > itself in the cassandra key space. So we get an initialised class & a row > in a table on the cluster. My problem is this: should we even be doing this? > > By this I mean, we could be facing an increasing number of transactions, > which we (naturally) would like to process as quickly as possible. The > input transactions themselves may well be routed to a number of processes, > e.g. starting an agent, written to a log file, etc. So it seems wrong to be > putting the 'INSERT ... INTO ...' code into the class instantiation: it > would seem more sensible to split this into a bunch of different spark > processes, with an input handler, database insertion, create new python > object, update log file, all happening on the spark cluster, & all written > as atomically as possible. > > But I think my reservations here are more fundamental. Is python the wrong > choice for this sort of thing? Would it not be better to use scala? > Shouldn't we be dividing these tasks into atomic processes which execute as > rapidly as possible? What about streaming events to the cluster, wouldn't > python be a bottleneck here rather than scala with its more robust support > for multithreading? Is streaming even supported in python? > > What do people think? > > Best regards, > > Johnny > > -- > Johnny Kelsey > Chief Technology Officer > *Semblent* > *jkkel...@semblent.com <jkkel...@semblent.com>* >