Re: cassandra + spark / pyspark

Francisco Madrid-Salvador Fri, 12 Sep 2014 07:03:43 -0700

Hi Oleg,

Connectors don't deal with HA, they rely on Spark for that, so neitherthe Datastax connector, Stratio Deep nor Calliope have anything to dowith Spark's HA. You should have previously configured Spark so that itmeets your high availability needs. Furthermore, as I mentioned in aprevious answer, Spark can be configured to have high availabilitywithout the use of Mesos, you have more information in"https://spark.apache.org/docs/latest/spark-standalone.html#high-availability";<https://spark.apache.org/docs/latest/spark-standalone.html#high-availability>.The three of them have similar features so all of them seem goodchoices. One of the highlights of Stratio Deep is that it's able toconnect with multiple databases, not just Cassandra (currently withCassandra and MongoDB, more on the roadmap). Also take into account thatStratio Deep integration with Cassandra was developed from the ground upmaking no use of Hadoop at all.

On the other hand, Spark does in-memory computation but this doesn'tmean it's not able to process data that doesn't fit in memory. It willuse disk if told so, and quoting the Spark oficial faq, "Spark caneither spill it to disk or recompute the partitions that don't fit inRAM each time they are requested. By default, it uses recomputation, butyou can set a dataset's storage level to MEMORY_AND_DISK to avoid this."


El 11/09/14 a las #4, Oleg Ruchovets escribió:

Ok.

DataStax , Startio are required mesos, hadoop yarn other thirdparty to get spark cluster HA.


What in case of calliope?

Is it sufficient to have cassandra + calliope + spark to be ableprocess aggregations?In my case we have quite a lot of data so doing aggregation only inmemory - impossible.


Does calliope support not in memory mode for spark?

Thanks
Oleg.

On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary<abhinav.chowd...@gmail.com <mailto:abhinav.chowd...@gmail.com>> wrote:


    Adding to conversation...

    there are 3 great open source options available

    1. Calliope http://tuplejump.github.io/calliope/
        This is the first library that was out some time late last
    year (as i can recall) and I have been using this for a while,
    mostly very stable, uses Hadoop i/o in Cassandra (note that it
    doesn't require hadoop)

    2. Datastax spark cassandra connector
    https://github.com/datastax/spark-cassandra-connector: Main
    difference is this uses cql3, again a great library but has few
    issues, also is very actively developed by far and still uses
    thrift for minor stuff but all heavy lifting in cql3

    3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot
    more to offer if you use all startio stack, Deep is for Spark,
    Statio Streaming is built on top of spark streaming, Stratio meta
    is something similar to sharkor sparksql and finally stratio
    Cassandra which is a fork of Cassandra with advanced Lucene based
    indexing

Re: cassandra + spark / pyspark

Reply via email to