If you're only doing this for spark, you'll be much better off using parquet and HDFS or S3. While you *can* do analytics with cassandra, it's not all that great at it. On Thu, Nov 17, 2016 at 6:05 AM Joe Olson <technol...@nododos.com> wrote:
> I received a grant to do some analysis on netflow data (Local IP address, > Local Port, Remote IP address, Remote Port, time, # of packets, etc) using > Cassandra and Spark. The de-normalized data set is about 13TB out the door. > I plan on using 9 Cassandra nodes (replication factor=3) to store the data, > with Spark doing the aggregation. > > Data set will be immutable once loaded, and am using the replication > factor = 3 to somewhat simulate the real world. Most of the analysis will > be of the sort "Give me all the remote ip addresses for source IP 'X' > between time t1 and t2" > > I built and tested a bulk loader following this example in GitHub: > https://github.com/yukim/cassandra-bulkload-example to generate the > SSTables, but I have not executed it on the entire data set yet. > > Any advice on how to execute the bulk load under this configuration? Can > I generate the SSTables in parallel? Once generated, can I write the > SSTables to all nodes simultaneously? Should I be doing any kind of sorting > by the partition key? > > This is a lot of data, so I figured I'd ask before I pulled the trigger. > Thanks in advance! > > >