Hi Ryan, As I said, saveToCassandra doesn't support "DELETE". This is why I modified the code of spark-cassandra-connector to allow me have DELETEs. What I change is how to bind a RDD row into a batch of CQL preparedStatements.
On Fri, Sep 25, 2015 at 7:22 AM, Ryan Svihla <r...@foundev.pro> wrote: > Why aren’t you using saveToCassandra ( > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md)? > They have a number of locality aware optimizations that will probably > exceed your by hand bulk loading (especially if you’re not doing it inside > something like foreach partition). > > Also you can easily tune up and down the size of those tasks and therefore > batches to minimize harm on the prod system. > > On Sep 24, 2015, at 5:37 PM, Benyi Wang <bewang.t...@gmail.com> wrote: > > I use Spark and spark-cassandra-connector with a customized Cassandra > writer (spark-cassandra-connector doesn’t support DELETE). Basically the > writer works as follows: > > - Bind a row in Spark RDD with either INSERT/Delete PreparedStatement > - Create a BatchStatement for multiple rows > - Write to Cassandra. > > I knew using CQLBulkOutputFormat would be better, but it doesn't supports > DELETE. > > > On Thu, Sep 24, 2015 at 1:27 PM, Gerard Maas <gerard.m...@gmail.com> > wrote: > >> How are you loading the data? I mean, what insert method are you using? >> >> On Thu, Sep 24, 2015 at 9:58 PM, Benyi Wang <bewang.t...@gmail.com> >> wrote: >> >>> I have a cassandra cluster provides data to a web service. And there is >>> a daily batch load writing data into the cluster. >>> >>> - Without the batch loading, the service’s Latency 99thPercentile is >>> 3ms. But during the load, it jumps to 90ms. >>> - I checked cassandra keyspace’s ReadLatency.99thPercentile, which >>> jumps to 1ms from 600 microsec. >>> - The service’s cassandra java driver request 99thPercentile was >>> 90ms during the load >>> >>> The java driver took the most time. I knew the Cassandra servers are >>> busy in writing, but I want to know what kinds of metrics can identify >>> where is the bottleneck so that I can tune it. >>> >>> I’m using Cassandra 2.1.8 and Cassandra Java Driver 2.1.5. >>> >>> >> >> > > Regards, > > Ryan Svihla > >