Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Typo. I am talking about spark only. Thanks Oleg. On Thursday, September 11, 2014, DuyHai Doan wrote: > Stupid question: do you really need both Storm & Spark ? Can't you > implement the Storm jobs in Spark ? It will be operationally simpler to > have less moving parts. I'm not saying that Stor

Re: Mutation Stage does not finish

2014-09-10 Thread Benedict Elliott Smith
Could you post the results of jstack on the process somewhere? On Thu, Sep 11, 2014 at 7:07 AM, Robert Coli wrote: > On Wed, Sep 10, 2014 at 1:53 PM, Eduardo Cusa < > eduardo.c...@usmediaconsulting.com> wrote: > >> No, is still running the Mutation Stage. >> > > If you're sure that it is not re

Re: Mutation Stage does not finish

2014-09-10 Thread Robert Coli
On Wed, Sep 10, 2014 at 1:53 PM, Eduardo Cusa < eduardo.c...@usmediaconsulting.com> wrote: > No, is still running the Mutation Stage. > If you're sure that it is not receiving Hinted Handoff, then the only mutations in question can be from the replay of the commit log. The commit log should take

Re: Performance testing in Cassandra

2014-09-10 Thread Benedict Elliott Smith
With the official release of 2.1, I highly recommend using the new stress tool bundled with it - it is improved in many ways over the tool in 2.0, and is compatible with older clusters. It supports the same simple mode of operation as the old stress, with better command line interface and more acc

Re: Mutation Stage does not finish

2014-09-10 Thread Eduardo Cusa
No, is still running the Mutation Stage. On Wed, Sep 10, 2014 at 5:38 PM, Robert Coli wrote: > On Wed, Sep 10, 2014 at 12:16 PM, Eduardo Cusa < > eduardo.c...@usmediaconsulting.com> wrote: > >> Yes, I restarted the node becaouse the write latency was 2500 ms, when >> usually is 5 ms. >> > > And

Re: Mutation Stage does not finish

2014-09-10 Thread Robert Coli
On Wed, Sep 10, 2014 at 12:16 PM, Eduardo Cusa < eduardo.c...@usmediaconsulting.com> wrote: > Yes, I restarted the node becaouse the write latency was 2500 ms, when > usually is 5 ms. > And did that help? =Rob

Re: Storage: upsert vs. delete + insert

2014-09-10 Thread graham sanderson
agreed On Sep 10, 2014, at 3:27 PM, olek.stas...@gmail.com wrote: > You're right, there is no data in tombstone, only a column name. So > there is only small overhead of disk size after delete. But i must > agree with post above, it's pointless in deleting prior to inserting. > Moreover, it needs

Re: Storage: upsert vs. delete + insert

2014-09-10 Thread olek.stas...@gmail.com
You're right, there is no data in tombstone, only a column name. So there is only small overhead of disk size after delete. But i must agree with post above, it's pointless in deleting prior to inserting. Moreover, it needs one op more to compute resulting row. cheers, Olek 2014-09-10 22:18 GMT+02

Re: Storage: upsert vs. delete + insert

2014-09-10 Thread graham sanderson
delete inserts a tombstone which is likely smaller than the original record (though still (currently) has overhead of cost for full key/column name the data for the insert after a delete would be identical to the data if you just inserted/updated no real benefit I can think of for doing the dele

Node being rebuilt receives read requests

2014-09-10 Thread Tom van den Berge
I have a datacenter with a single node, and I want to start using vnodes. I have followed the instructions ( http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html), and set up a new node in a new datacenter (auto_bootstrap=false, seed=node in old dc,

Re: Storage: upsert vs. delete + insert

2014-09-10 Thread olek.stas...@gmail.com
I think so. this is how i see it: on the very beginning you have such line in datafile: {key: [col_name, col_value, date_of_last_change]} //something similar, i don't remember now after delete you're adding line: {key:[col_name, last_col_value, date_of_delete, 'd']} //this d indicates that field i

Re: Mutation Stage does not finish

2014-09-10 Thread Eduardo Cusa
Yes, I restarted the node becaouse the write latency was 2500 ms, when usually is 5 ms. On Wed, Sep 10, 2014 at 4:05 PM, Robert Coli wrote: > On Wed, Sep 10, 2014 at 12:03 PM, Eduardo Cusa < > eduardo.c...@usmediaconsulting.com> wrote: > >> Yes, the tpstats is printing. The Opcenter show the no

Re: Mutation Stage does not finish

2014-09-10 Thread Eduardo Cusa
Yes, the tpstats is printing. The Opcenter show the node down. Also OpCenter show the following status: Status: Unresponsive - Starting Gossip:Down Thrift:Down Native Transport: Down Pending Tasks: 0 On Wed, Sep 10, 2014 at 3:54 PM, Robert Coli wrote: > On Wed, Sep 10, 2014 at 11:38 AM

Re: cassandra + spark / pyspark

2014-09-10 Thread Paco Madrid
Hi Oleg. Spark can be configured to have high availability without the need for Mesos ( https://spark.apache.org/docs/latest/spark-standalone.html#high-availability), for instance using Zookeeper and standby masters. If I'm not wrong Storm doesn't need Mesos to work, so I imagine you use it to mak

Re: Mutation Stage does not finish

2014-09-10 Thread Robert Coli
On Wed, Sep 10, 2014 at 12:03 PM, Eduardo Cusa < eduardo.c...@usmediaconsulting.com> wrote: > Yes, the tpstats is printing. The Opcenter show the node down. > Have you recently restarted it or anything? If not, try doing so? =Rob

Re: Mutation Stage does not finish

2014-09-10 Thread Robert Coli
On Wed, Sep 10, 2014 at 11:38 AM, Eduardo Cusa < eduardo.c...@usmediaconsulting.com> wrote: > Actually the node is *down*. > The node can't be that "down" if it's printing tpstats... https://issues.apache.org/jira/browse/CASSANDRA-4162 ? =Rob

Re: cassandra + spark / pyspark

2014-09-10 Thread Paco Madrid
Good to know. Thanks, DuyHai! I'll take a look (but most probably tomorrow ;-)) Paco 2014-09-10 20:15 GMT+02:00 DuyHai Doan : > Source code check for the Java version: > https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-java/src/main/java/com/datastax/sp

Re: Atomic batch of counters in Cassandra 2.1

2014-09-10 Thread Eugene Voytitsky
On 10.09.14 02:09, Robert Coli wrote: On Tue, Sep 9, 2014 at 2:36 PM, Eugene Voytitsky mailto:viy@gmail.com>> wrote: As I understand, atomic batch for counters can't work correctly (atomically) prior to 2.1 because of counters implementation. [Link: http://www.datastax.com/de

Mutation Stage does not finish

2014-09-10 Thread Eduardo Cusa
Hello, I have a node that is in MutationStage for the last 5 hours. Actually the node is *down*. The pendings task go from 776 to 110 and then to 964. There are some way to finish this stage? The last heavy write workload was 5 days ago. Pool NameActive Pending Com

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
Stupid question: do you really need both Storm & Spark ? Can't you implement the Storm jobs in Spark ? It will be operationally simpler to have less moving parts. I'm not saying that Storm is not the right fit, it may be totally suitable for some usages. But if you want to avoid the SPOF thing an

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Interesting things actually: We have hadoop in our eco system. It has single point of failure and I am not sure about inter data center replication. Plan is to use cassandra - no single point of failure , there is data center replication. For aggregation/transformation using SPARK. BUT storm r

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
Source code check for the Java version: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-java/src/main/java/com/datastax/spark/connector/RDDJavaFunctions.java#L26 It's using the RDDFunctions from scala code so yes, it's Java driver again. On Wed, Sep 10

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
"As far as I know, the Datastax connector uses thrift to connect Spark with Cassandra although thrift is already deprecated, could someone confirm this point?" --> the Scala connector is using the latest Java driver, so no there is no Thrift there. For the Java version, I'm not sure, have not lo

cassandra + spark / pyspark

2014-09-10 Thread Francisco Madrid-Salvador
Hi Oleg, Stratio Deep is just a library you must include in your Spark deployment so it doesn't guarantee any high availability at all. To achieve HA you must use Mesos or any other 3rd party resource manager. Stratio doesn't currently support PySpark, just Scala and Java. Perhaps in the fut

Cassandra -What is really happens once Key-cache get filled

2014-09-10 Thread Job Thomas
Consider that I have configured 1 Mb of key-cache (Consider it can hold 13000 of keys ). Then I wrote some records in a column family(say 2). Then read it at first (All keys sequentially in the same order used to write ), and keys are started to stored in key-cache. When the read reached @

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
"can you share please where can I read about mesos integration for HA and StandAlone mode execution?" --> You can find all the info in the Spark documentation, read this: http://spark.apache.org/docs/latest/cluster-overview.html Basically, you have 3 choices: 1) Stand alone mode: get your hands

Re: Storage: upsert vs. delete + insert

2014-09-10 Thread Michal Budzyn
Would the factor before compaction be always 2 ? On Wed, Sep 10, 2014 at 6:38 PM, olek.stas...@gmail.com < olek.stas...@gmail.com> wrote: > IMHO, delete then insert will take two times more disk space then > single insert. But after compaction the difference will disappear. > This was true in ver

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Thanks for the info. can you share please where can I read about mesos integration for HA and StandAlone mode execution? Thanks Oleg. On Thu, Sep 11, 2014 at 12:13 AM, DuyHai Doan wrote: > Hello Oleg > > Question 2: yes. The official spark cassandra connector can be found here: > https://git

Re: cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Great stuff Paco. Thanks for sharing. Couple of questions: Is it required additional installation to be HA like apache mesos? Are you supporting PySpark? How stable /ready for production ? Thanks Oleg. On Thu, Sep 11, 2014 at 12:01 AM, Francisco Madrid-Salvador < pmad...@stratio.com> wrote: >

Re: Storage: upsert vs. delete + insert

2014-09-10 Thread olek.stas...@gmail.com
IMHO, delete then insert will take two times more disk space then single insert. But after compaction the difference will disappear. This was true in version prior to 2.0, but it should still work this way. But maybe someone will correct me, if i'm wrong. Cheers, Olek 2014-09-10 18:30 GMT+02:00 Mi

Re: Storage: upsert vs. delete + insert

2014-09-10 Thread Michal Budzyn
One insert would be much better e.g. for performance and network latency. I wanted to know if there is a significant difference (apart from additional commit log entry) in the used storage between these 2 use cases.

Re: multi datacenter replication

2014-09-10 Thread Jonathan Haddad
Multi-dc is available in every version of Cassandra. On Wed, Sep 10, 2014 at 9:21 AM, Oleg Ruchovets wrote: > Thank you very much for the links. > Just to be sure: is this capability available for COMMUNITY ADDITION? > > Thanks > Oleg. > > On Wed, Sep 10, 2014 at 11:49 PM, Alain RODRIGUEZ > wr

Re: multi datacenter replication

2014-09-10 Thread Oleg Ruchovets
Thank you very much for the links. Just to be sure: is this capability available for COMMUNITY ADDITION? Thanks Oleg. On Wed, Sep 10, 2014 at 11:49 PM, Alain RODRIGUEZ wrote: > Hi Oleg, > > Yes Replication cross DC is something available for a long time already, > so it is assumed to be stabl

Re: cassandra + spark / pyspark

2014-09-10 Thread DuyHai Doan
Hello Oleg Question 2: yes. The official spark cassandra connector can be found here: https://github.com/datastax/spark-cassandra-connector There is docs in the doc/ folder. You can read & write directly from/to Cassandra without EVER using HDFS. You still need a resource manager like Apache Meso

Re: Storage: upsert vs. delete + insert

2014-09-10 Thread Shane Hansen
My understanding is that a update is the same as an insert. So I would think delete+insert is a bad idea. Also insert+delete would put 2 entries in the commit log. On Sep 10, 2014 9:49 AM, "Michal Budzyn" wrote: > Is there any serious difference in the used disk and memory storage > between upser

cassandra + spark / pyspark

2014-09-10 Thread Francisco Madrid-Salvador
Hi Oleg, If you want to use cassandra+spark without hadoop, perhaps Stratio Deep is your best choice (https://github.com/Stratio/stratio-deep). It's an open-source Spark + Cassandra connector that doesn't make any use of Hadoop or Hadoop component. http://docs.openstratio.org/deep/0.3.3/abou

Re: multi datacenter replication

2014-09-10 Thread Alain RODRIGUEZ
Hi Oleg, Yes Replication cross DC is something available for a long time already, so it is assumed to be stable. As discussed in this thread, Cassandra documentation is often outdated or inexistant, the alternative is datastax one. http://www.datastax.com/documentation/cassandra/2.0/cassandra/in

Storage: upsert vs. delete + insert

2014-09-10 Thread Michal Budzyn
Is there any serious difference in the used disk and memory storage between upsert and delete + insert ? e.g. 2 vs 2A + 2B. PK ((key), version, c1) 1. INSERT INTO A (key , version , c1, val) values (1, 1, 4711, “X1”) ... 2. INSERT INTO A (key , version , c1, val) values (1, 1, 4711, “X2”) Vs. 2A

cassandra + spark / pyspark

2014-09-10 Thread Oleg Ruchovets
Hi , I try to evaluate different option of spark + cassandra and I have couple of questions: My aim is to use cassandra+spark without hadoop: 1) Is it possible to use only cassandra as input/output parameter for PySpark? 2) In case I'll use Spark (java,scala) is it possible to use only cass

multi datacenter replication

2014-09-10 Thread Oleg Ruchovets
Hi All. Is multi datacenter replication capability available in community addition? If yes can someone share the experience how stable is it and where can I read the best practice of it? Thanks Oleg.

Re: Questions about cleaning up/purging Hinted Handoffs

2014-09-10 Thread Rahul Neelakantan
Will try... Thank you Rahul Neelakantan > On Sep 10, 2014, at 12:01 AM, Rahul Menon wrote: > > I use jmxterm. http://wiki.cyclopsgroup.org/jmxterm/ attach it to your c* > process and then use the org.apache.cassandra.db:HintedHandoffManager bean > and run deleteHintsforEndpoint to drop hint