Johnny, Without knowing the domain of the problem it is hard to choose a programming language. I would suggest you ask yourself the following questions: - What if your project depends on a lot of python libraries that don't have Scala/Java counterparts? It is unlikely but possible. - What if Python programmers are in good supply and Scala ones not as much? - Do you need to rewrite a lot of code, is that feasible? - Is the rest of your team willing to learn Scala? - If you are processing streams in a long lived process, how does Python perform?
Mohit. P.S.: I end up choosing Scala more often than Python. On Thu, Sep 4, 2014 at 8:03 AM, Johnny Kelsey <jkkel...@semblent.com> wrote: > Hi guys, > > We're testing out a spark/cassandra cluster, & we're very impressed with > what we've seen so far. However, I'd very much like some advice from the > shiny brains on the mailing list. > > We have a large collection of python code that we're in the process of > adapting to move into spark/cassandra, & I have some misgivings on using > python for any further development. > > As a concrete example, we have a python class (part of a fairly large > class library) which, as part of its constructor, also creates a record of > itself in the cassandra key space. So we get an initialised class & a row > in a table on the cluster. My problem is this: should we even be doing this? > > By this I mean, we could be facing an increasing number of transactions, > which we (naturally) would like to process as quickly as possible. The > input transactions themselves may well be routed to a number of processes, > e.g. starting an agent, written to a log file, etc. So it seems wrong to be > putting the 'INSERT ... INTO ...' code into the class instantiation: it > would seem more sensible to split this into a bunch of different spark > processes, with an input handler, database insertion, create new python > object, update log file, all happening on the spark cluster, & all written > as atomically as possible. > > But I think my reservations here are more fundamental. Is python the wrong > choice for this sort of thing? Would it not be better to use scala? > Shouldn't we be dividing these tasks into atomic processes which execute as > rapidly as possible? What about streaming events to the cluster, wouldn't > python be a bottleneck here rather than scala with its more robust support > for multithreading? Is streaming even supported in python? > > What do people think? > > Best regards, > > Johnny > > -- > Johnny Kelsey > Chief Technology Officer > *Semblent* > *jkkel...@semblent.com <jkkel...@semblent.com>* >