Re: Any Bulk Load on Large Data Set Advice?

Jonathan Haddad Thu, 17 Nov 2016 07:33:53 -0800

If you're only doing this for spark, you'll be much better off using
parquet and HDFS or S3. While you *can* do analytics with cassandra, it's
not all that great at it.
On Thu, Nov 17, 2016 at 6:05 AM Joe Olson <technol...@nododos.com> wrote:


> I received a grant to do some analysis on netflow data (Local IP address,
> Local Port, Remote IP address, Remote Port, time, # of packets, etc) using
> Cassandra and Spark. The de-normalized data set is about 13TB out the door.
> I plan on using 9 Cassandra nodes (replication factor=3) to store the data,
> with Spark doing the aggregation.
>
> Data set will be immutable once loaded, and am using the replication
> factor = 3 to somewhat simulate the real world. Most of the analysis will
> be of the sort "Give me all the remote ip addresses for source IP 'X'
> between time t1 and t2"
>
> I built and tested a bulk loader following this example in GitHub:
> https://github.com/yukim/cassandra-bulkload-example to generate the
> SSTables, but I have not executed it on the entire data set yet.
>
> Any advice on how to execute the bulk load under this configuration?  Can
> I generate the SSTables in parallel? Once generated, can I write the
> SSTables to all nodes simultaneously? Should I be doing any kind of sorting
> by the partition key?
>
> This is a lot of data, so I figured I'd ask before I pulled the trigger.
> Thanks in advance!
>
>
>

Re: Any Bulk Load on Large Data Set Advice?

Reply via email to