Re: Any Bulk Load on Large Data Set Advice?

Jeff Jirsa Thu, 17 Nov 2016 09:41:00 -0800

Other people are commenting on the appropriateness of Cassandra – they may have 
a point you should consider, but I’m going to answer the question.


 

1)       Yes, you can generate the sstables in parallel

2)       If you use sstable bulk loader interface (sstableloader), it’ll stream 
to all appropriate nodes. You can run sstableloader from multiple nodes at the 
same time as well. 

3)       Sorting by partition key probably won’t hurt. If you run jobs in 
parallel, dividing them up by partition key seems like a good way to 
parallelize your task. 

 

We do something like this in certain parts of our workflow, and it works well.  

 

 

 

From: Joe Olson <technol...@nododos.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thursday, November 17, 2016 at 5:58 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Any Bulk Load on Large Data Set Advice?

 

I received a grant to do some analysis on netflow data (Local IP address, Local 
Port, Remote IP address, Remote Port, time, # of packets, etc) using Cassandra 
and Spark. The de-normalized data set is about 13TB out the door. I plan on 
using 9 Cassandra nodes (replication factor=3) to store the data, with Spark 
doing the aggregation. 

 

Data set will be immutable once loaded, and am using the replication factor = 3 
to somewhat simulate the real world. Most of the analysis will be of the sort 
"Give me all the remote ip addresses for source IP 'X' between time t1 and t2"

 

I built and tested a bulk loader following this example in GitHub: 
https://github.com/yukim/cassandra-bulkload-example to generate the SSTables, 
but I have not executed it on the entire data set yet.

 

Any advice on how to execute the bulk load under this configuration?  Can I 
generate the SSTables in parallel? Once generated, can I write the SSTables to 
all nodes simultaneously? Should I be doing any kind of sorting by the 
partition key?

 

This is a lot of data, so I figured I'd ask before I pulled the trigger. Thanks 
in advance!

smime.p7s
Description: S/MIME cryptographic signature

Re: Any Bulk Load on Large Data Set Advice?

Reply via email to