Re: Cassandra & Spark

2018-08-24 Thread Affan Syed
.org >>> >>> Which then act as a layer between Cassandra and your applications >>> storing into Cassandra (memory datagrid I think it is called) >>> >>> Basically, think of it as a big cache >>> >>> It is an in-memory thingi ☺ >>> &

Re: Cassandra & Spark

2018-08-24 Thread CharSyam
ssandra (memory datagrid I think it is called) >> >> Basically, think of it as a big cache >> >> It is an in-memory thingi ☺ >> >> And then you can run some super fast queries >> >> >> >> -Tobias >> >> >> >> *From:

Re: Cassandra & Spark

2018-08-24 Thread Affan Syed
Tobias > > > > *From: *DuyHai Doan > *Date: *Thursday, 8 June 2017 at 15:42 > *To: *Tobias Eriksson > *Cc: *한 승호 , "user@cassandra.apache.org" < > user@cassandra.apache.org> > *Subject: *Re: Cassandra & Spark > > > > Interesting >

Re: cassandra spark-connector-sqlcontext too many tasks

2018-03-17 Thread Ben Slater
I think that is probably a question for the Spark Connector forum: https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user as it’s much more related to the function of the connector than functionality of Cassandra itself. Cheers Ben On Sat, 17 Mar 2018 at 21:18 onmsteste

Re: Cassandra/Spark failing to process large table

2018-03-08 Thread kurt greaves
Note that read repairs only occur for QUORUM/equivalent and higher, and also with a 10% (default) chance on anything less than QUORUM (ONE/LOCAL_ONE). This is configured at the table level through the dclocal_read_repair_chance and read_repair_chance settings (which are going away in 4.0). So if yo

Re: Cassandra/Spark failing to process large table

2018-03-08 Thread Faraz Mateen
Hi Ben, That makes sense. I also read about "read repairs". So, once an inconsistent record is read, cassandra synchronizes its replicas on other nodes as well. I ran the same spark query again, this time with default consistency level (LOCAL_ONE) and the result was correct. Thanks again for the

Re: Cassandra/Spark failing to process large table

2018-03-06 Thread Ben Slater
Hi Faraz Yes, it likely does mean there is inconsistency in the replicas. However, you shouldn’t be too freaked out about it - Cassandra is design to allow for this inconsistency to occur and the consistency levels allow you to achieve consistent results despite replicas not being consistent. To k

Re: Cassandra/Spark failing to process large table

2018-03-06 Thread Faraz Mateen
Thanks a lot for the response. Setting consistency to ALL/TWO started giving me consistent count results on both cqlsh and spark. As expected, my query time has increased by 1.5x ( Before, it was taking ~1.6 hours but with consistency level ALL, same query is taking ~2.4 hours to complete.) Does

Re: Cassandra/Spark failing to process large table

2018-03-03 Thread Ben Slater
Both CQLSH and the Spark Cassandra query at consistent level ONE (LOCAL_ONE for Spark connector) by default so if there is any inconsistency in your replicas this can resulting in inconsistent query results. See http://cassandra.apache.org/doc/latest/tools/cqlsh.html and https://github.com/datasta

Re: Cassandra/Spark failing to process large table

2018-03-03 Thread Kant Kodali
The fact that cqlsh itself gives different results tells me that this has nothing to do with spark. Moreover, spark results are monotonically increasing which seem to be more consistent than cqlsh. so I believe spark can be taken out of the equation. Now, while you are running these queries is th

Re: Cassandra & Spark

2017-06-08 Thread Tobias Eriksson
ng into Cassandra (memory datagrid I think it is called) Basically, think of it as a big cache It is an in-memory thingi ☺ And then you can run some super fast queries -Tobias From: DuyHai Doan Date: Thursday, 8 June 2017 at 15:42 To: Tobias Eriksson Cc: 한 승호 , "user@cassandra.apache.org&qu

Re: Cassandra & Spark

2017-06-08 Thread DuyHai Doan
Interesting Tobias, when you said "Instead we transferred the data to Apache Kudu", did you transfer all Cassandra data into Kudu from with a single migration and then tap into Kudo for aggregation or did you run data import every day/week/month from Cassandra into Kudu ? >From my point of view,

Re: Cassandra & Spark

2017-06-08 Thread Tobias Eriksson
Hi Something to consider before moving to Apache Spark and Cassandra I have a background where we have tons of data in Cassandra, and we wanted to use Apache Spark to run various jobs We loved what we could do with Spark, BUT…. We realized soon that we wanted to run multiple jobs in parallel Some

Re: Cassandra & Spark

2017-06-08 Thread Kant Kodali
If you use Containers like Docker Plan A can work provided you do the resource and capacity planning. I tend to think that Plan B is more Standard and easier Although you can wait to hear from others for a second opinion. Caution: Data Locality will make sense if the Disk throughput is significant

Re: Cassandra - Spark - Flume: best architecture for log analytics.

2015-07-23 Thread Edward Ribeiro
Disclaimer: I have worked for DataStax. Cassandra is fairly good for log analytics and has been used many places for that ( https://www.usenix.org/conference/lisa14/conference-program/presentation/josephsen ). Of course, requirements vary from place to place, but it has been a good fit. Spark and

Re: Cassandra - Spark - Flume: best architecture for log analytics.

2015-07-23 Thread Ipremyadav
Though DSE cassandra comes with hadoop integration, this is clearly is use case for hadoop. Any reason why cassandra is your first choice? > On 23 Jul 2015, at 6:12 a.m., Pierre Devops wrote: > > Cassandra is not very good at massive read/bulk read if you need to retrieve > and compute a la

Re: Cassandra - Spark - Flume: best architecture for log analytics.

2015-07-22 Thread Pierre Devops
Cassandra is not very good at massive read/bulk read if you need to retrieve and compute a large amount of data on multiple machines using something like spark or hadoop (or you'll need to hack and process the sstable directly, something which is not "natively" supported, you'll have to hack your w

Re: cassandra + spark / pyspark

2014-09-12 Thread Francisco Madrid-Salvador
Hi Oleg, Connectors don't deal with HA, they rely on Spark for that, so neither the Datastax connector, Stratio Deep nor Calliope have anything to do with Spark's HA. You should have previously configured Spark so that it meets your high availability needs. Furthermore, as I mentioned in a pr

Re: cassandra + spark / pyspark

2014-09-11 Thread Oleg Ruchovets
Thank you Rohit. I sent the email to you. Thanks Oleg. On Thu, Sep 11, 2014 at 10:51 PM, Rohit Rai wrote: > Hi Oleg, > > I am the creator of Calliope. Calliope doesn't force any deployment > model... that means you can run it with Mesos or Hadoop or Standalone. To > be fair I don't think the

Re: cassandra + spark / pyspark

2014-09-11 Thread Rohit Rai
Hi Oleg, I am the creator of Calliope. Calliope doesn't force any deployment model... that means you can run it with Mesos or Hadoop or Standalone. To be fair I don't think the other libs mentioned here should work too. The Spark cluster HA can be provided using ZooKeeper even in the standalone d

Re: cassandra + spark / pyspark

2014-09-11 Thread Oleg Ruchovets
Ok. DataStax , Startio are required mesos, hadoop yarn other third party to get spark cluster HA. What in case of calliope? Is it sufficient to have cassandra + calliope + spark to be able process aggregations? In my case we have quite a lot of data so doing aggregation only in memory - impossi

Re: cassandra + spark / pyspark

2014-09-11 Thread DuyHai Doan
2. "still uses thrift for minor stuff" --> I think that the only call using thrift is "describe_ring" to get an estimate of ratio of partition keys within the token range 3. Stratio has a talk today at the SF Summit, presenting Stratio META. For the folks not attending the conference, video should

Re: cassandra + spark / pyspark

2014-09-11 Thread abhinav chowdary
Adding to conversation... there are 3 great open source options available 1. Calliope http://tuplejump.github.io/calliope/ This is the first library that was out some time late last year (as i can recall) and I have been using this for a while, mostly very stable, uses Hadoop i/o in Cassandra

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Typo. I am talking about spark only. Thanks Oleg. On Thursday, September 11, 2014, DuyHai Doan wrote: > Stupid question: do you really need both Storm & Spark ? Can't you > implement the Storm jobs in Spark ? It will be operationally simpler to > have less moving parts. I'm not saying that Stor

Re: cassandra + spark / pyspark

2014-09-10 Thread Paco Madrid
Hi Oleg. Spark can be configured to have high availability without the need for Mesos ( https://spark.apache.org/docs/latest/spark-standalone.html#high-availability), for instance using Zookeeper and standby masters. If I'm not wrong Storm doesn't need Mesos to work, so I imagine you use it to mak

Re: cassandra + spark / pyspark

2014-09-10 Thread Paco Madrid
Good to know. Thanks, DuyHai! I'll take a look (but most probably tomorrow ;-)) Paco 2014-09-10 20:15 GMT+02:00 DuyHai Doan : > Source code check for the Java version: > https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-java/src/main/java/com/datastax/sp

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
Stupid question: do you really need both Storm & Spark ? Can't you implement the Storm jobs in Spark ? It will be operationally simpler to have less moving parts. I'm not saying that Storm is not the right fit, it may be totally suitable for some usages. But if you want to avoid the SPOF thing an

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Interesting things actually: We have hadoop in our eco system. It has single point of failure and I am not sure about inter data center replication. Plan is to use cassandra - no single point of failure , there is data center replication. For aggregation/transformation using SPARK. BUT storm r

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
Source code check for the Java version: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-java/src/main/java/com/datastax/spark/connector/RDDJavaFunctions.java#L26 It's using the RDDFunctions from scala code so yes, it's Java driver again. On Wed, Sep 10

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
"As far as I know, the Datastax connector uses thrift to connect Spark with Cassandra although thrift is already deprecated, could someone confirm this point?" --> the Scala connector is using the latest Java driver, so no there is no Thrift there. For the Java version, I'm not sure, have not lo

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
"can you share please where can I read about mesos integration for HA and StandAlone mode execution?" --> You can find all the info in the Spark documentation, read this: http://spark.apache.org/docs/latest/cluster-overview.html Basically, you have 3 choices: 1) Stand alone mode: get your hands

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Thanks for the info. can you share please where can I read about mesos integration for HA and StandAlone mode execution? Thanks Oleg. On Thu, Sep 11, 2014 at 12:13 AM, DuyHai Doan wrote: > Hello Oleg > > Question 2: yes. The official spark cassandra connector can be found here: > https://git

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Great stuff Paco. Thanks for sharing. Couple of questions: Is it required additional installation to be HA like apache mesos? Are you supporting PySpark? How stable /ready for production ? Thanks Oleg. On Thu, Sep 11, 2014 at 12:01 AM, Francisco Madrid-Salvador < pmad...@stratio.com> wrote: >

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
Hello Oleg Question 2: yes. The official spark cassandra connector can be found here: https://github.com/datastax/spark-cassandra-connector There is docs in the doc/ folder. You can read & write directly from/to Cassandra without EVER using HDFS. You still need a resource manager like Apache Meso