Johnny,
Without knowing the domain of the problem it is hard to choose a
programming language. I would suggest you ask yourself the following
questions:
- What if your project depends on a lot of python libraries that don't have
Scala/Java counterparts? It is unlikely but possible.
- What if Python programmers are in good supply and Scala ones not as much?
- Do you need to rewrite a lot of code, is that feasible?
- Is the rest of your team willing to learn Scala?
- If you are processing streams in a long lived process, how does Python
perform?

Mohit.
P.S.: I end up choosing Scala more often than Python.


On Thu, Sep 4, 2014 at 8:03 AM, Johnny Kelsey <jkkel...@semblent.com> wrote:

> Hi guys,
>
> We're testing out a spark/cassandra cluster, & we're very impressed with
> what we've seen so far. However, I'd very much like some advice from the
> shiny brains on the mailing list.
>
> We have a large collection of python code that we're in the process of
> adapting to move into spark/cassandra, & I have some misgivings on using
> python for any further development.
>
> As a concrete example, we have a python class (part of a fairly large
> class library) which, as part of its constructor, also creates a record of
> itself in the cassandra key space. So we get an initialised class & a row
> in a table on the cluster. My problem is this: should we even be doing this?
>
> By this I mean, we could be facing an increasing number of transactions,
> which we (naturally) would like to process as quickly as possible. The
> input transactions themselves may well be routed to a number of processes,
> e.g. starting an agent, written to a log file, etc. So it seems wrong to be
> putting the 'INSERT ... INTO ...' code into the class instantiation: it
> would seem more sensible to split this into a bunch of different spark
> processes, with an input handler, database insertion, create new python
> object, update log file, all happening on the spark cluster, & all written
> as atomically as possible.
>
> But I think my reservations here are more fundamental. Is python the wrong
> choice for this sort of thing? Would it not be better to use scala?
> Shouldn't we be dividing these tasks into atomic processes which execute as
> rapidly as possible? What about streaming events to the cluster, wouldn't
> python be a bottleneck here rather than scala with its more robust support
> for multithreading?  Is streaming even supported in python?
>
> What do people think?
>
> Best regards,
>
> Johnny
>
> --
> Johnny Kelsey
> Chief Technology Officer
> *Semblent*
> *jkkel...@semblent.com <jkkel...@semblent.com>*
>

Reply via email to