Re: Replicating Cassandra data to HDFS

2016-08-09 Thread Jonathan Haddad
My understand is that all CDC really is now is a stable commit log reader. For a given mutation on an RF=3 system, you'll end up with 3 readers that all *could* do some action. For now let's just say "put it it in a Kafka topic" because that lets us do anything we want after that. I suppose the m

Re: Incremental repairs leading to unrepaired data

2016-08-09 Thread Paulo Motta
Anticompaction throttling can be done by setting the usual compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool setcompactionthroughput. Did you try lowering that and checking if that improves the dropped mutations? 2016-08-09 13:32 GMT-03:00 Stefano Ortolani : > Hi all, > > I

Re: Replicating Cassandra data to HDFS

2016-08-09 Thread Ryan Svihla
Jon, You know I've not actually spent the hour to read the ticket so I was just guessing it didn't handle dedup...all the same semantics apply though..you'd have to do a read before write and then allow some window of failure mode. Maybe if you were LWT everything but that sounds really slow...

Re: Replicating Cassandra data to HDFS

2016-08-09 Thread Jonathan Haddad
I'm having a hard time seeing how anyone would be able to work with CDC in it's currently implementation of not doing any dedupe. Unless you really want to write all your own logic for that including failure handling + a distributed state machine I wouldn't count on it as a solution. On Tue, Aug

Re: Replicating Cassandra data to HDFS

2016-08-09 Thread Ryan Svihla
You can follow the monster of a ticket https://issues.apache.org/jira/browse/CASSANDRA-8844 and see if it looks like the tradeoffs there are headed in the right direction for you. even CDC I think would have the logically same issue of not deduping for you as triggers and dual write due to repl

Re: Replicating Cassandra data to HDFS

2016-08-09 Thread Ben Vogan
Thanks Ryan. I was hoping there was a change data capture framework. We have late arriving events, some of which can be very late. We would have to batch collect data for a large time period every so often to go back and collect those or accept that we are going to lose a small percentage of eve

Re: Replicating Cassandra data to HDFS

2016-08-09 Thread Ryan Svihla
The typical pattern I've seen in the field is kafka + consumers for each destination (variant of dual write I know), this of course would not work for your goal of relying on C* for dedup. Triggers would also suffer the same problem unfortunately so you're really left with a batch job (most like

Incremental repairs leading to unrepaired data

2016-08-09 Thread Stefano Ortolani
Hi all, I am running incremental repaird on a weekly basis (can't do it every day as one single run takes 36 hours), and every time, I have at least one node dropping mutations as part of the process (this almost always during the anticompaction phase). Ironically this leads to a system where repa

Replicating Cassandra data to HDFS

2016-08-09 Thread Ben Vogan
Hi all, We are investigating using Cassandra in our data platform. We would like data to go into Cassandra first and to eventually be replicated into our data lake in HDFS for long term cold storage. Does anyone know of a good way of doing this? We would rather not have parallel writes to HDFS

Re: Stale value appears after consecutive TRUNCATE

2016-08-09 Thread Yuji Ito
Thanks Christian can you reproduce the behaviour with a single node? I tried my test with a single node. But I can't. This behaviour is seems to be CQL only, or at least has gotten worse with > CQL. I did not experience this with Thrift. I truncate tables with CQL. I've never tried with Thrift.

Re: OutOfMemoryError when initializing a secondary index

2016-08-09 Thread Carlos Alonso
If you're deleting all traces of the index you probably want to look at the commit log as they are probably being recreated from there. Hope it helps. Carlos Alonso | Software Engineer | @calonso On 5 August 2016 at 23:05, Charlie Moad wrote: > Running Cassandra 3

Re: Verify cassandra backup and restore in C * 2.1

2016-08-09 Thread Riccardo Ferrari
Hi Indranil, I think it really depends on what makes a backup "correct" for you. Do you have some test you can run on that data? When I want to test my data I usually restore it in a new cluster (ie. on AWS) and use Spark to perform some cross-tests. This is a bit cumbersome nevertheless does the