Other people are commenting on the appropriateness of Cassandra – they may have a point you should consider, but I’m going to answer the question.
1) Yes, you can generate the sstables in parallel 2) If you use sstable bulk loader interface (sstableloader), it’ll stream to all appropriate nodes. You can run sstableloader from multiple nodes at the same time as well. 3) Sorting by partition key probably won’t hurt. If you run jobs in parallel, dividing them up by partition key seems like a good way to parallelize your task. We do something like this in certain parts of our workflow, and it works well. From: Joe Olson <technol...@nododos.com> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> Date: Thursday, November 17, 2016 at 5:58 AM To: "user@cassandra.apache.org" <user@cassandra.apache.org> Subject: Any Bulk Load on Large Data Set Advice? I received a grant to do some analysis on netflow data (Local IP address, Local Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra and Spark. The de-normalized data set is about 13TB out the door. I plan on using 9 Cassandra nodes (replication factor=3) to store the data, with Spark doing the aggregation. Data set will be immutable once loaded, and am using the replication factor = 3 to somewhat simulate the real world. Most of the analysis will be of the sort "Give me all the remote ip addresses for source IP 'X' between time t1 and t2" I built and tested a bulk loader following this example in GitHub: https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, but I have not executed it on the entire data set yet. Any advice on how to execute the bulk load under this configuration? Can I generate the SSTables in parallel? Once generated, can I write the SSTables to all nodes simultaneously? Should I be doing any kind of sorting by the partition key? This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks in advance!
smime.p7s
Description: S/MIME cryptographic signature