Johnny,

Currently, probably the easiest (and most performant way) to integrate
Spark and Cassandra is using the spark-cassandra-connector [1]

Given an rdd, saving it to cassandra is as easy as:

rdd.saveToCassandra(keyspace, table, Seq(columns))

We tried many 'hand crafted' options to interact with Cassandra before and
this connector is the way to go.

This is currently in the Scala realm, and a reason heavy enough to tilt
your balance to Scala.
It also sounds that your current Python-based architecture needs a review,
so the migration could give you the opportunity for a fresh re-design.

-kr, Gerard.

[1] https://github.com/datastax/spark-cassandra-connector









On Thu, Sep 4, 2014 at 5:03 PM, Johnny Kelsey <jkkel...@semblent.com> wrote:

> Hi guys,
>
> We're testing out a spark/cassandra cluster, & we're very impressed with
> what we've seen so far. However, I'd very much like some advice from the
> shiny brains on the mailing list.
>
> We have a large collection of python code that we're in the process of
> adapting to move into spark/cassandra, & I have some misgivings on using
> python for any further development.
>
> As a concrete example, we have a python class (part of a fairly large
> class library) which, as part of its constructor, also creates a record of
> itself in the cassandra key space. So we get an initialised class & a row
> in a table on the cluster. My problem is this: should we even be doing this?
>
> By this I mean, we could be facing an increasing number of transactions,
> which we (naturally) would like to process as quickly as possible. The
> input transactions themselves may well be routed to a number of processes,
> e.g. starting an agent, written to a log file, etc. So it seems wrong to be
> putting the 'INSERT ... INTO ...' code into the class instantiation: it
> would seem more sensible to split this into a bunch of different spark
> processes, with an input handler, database insertion, create new python
> object, update log file, all happening on the spark cluster, & all written
> as atomically as possible.
>
> But I think my reservations here are more fundamental. Is python the wrong
> choice for this sort of thing? Would it not be better to use scala?
> Shouldn't we be dividing these tasks into atomic processes which execute as
> rapidly as possible? What about streaming events to the cluster, wouldn't
> python be a bottleneck here rather than scala with its more robust support
> for multithreading?  Is streaming even supported in python?
>
> What do people think?
>
> Best regards,
>
> Johnny
>
> --
> Johnny Kelsey
> Chief Technology Officer
> *Semblent*
> *jkkel...@semblent.com <jkkel...@semblent.com>*
>

Reply via email to