On Mon, Dec 1, 2014 at 12:10 PM, Dong Dai <daidon...@gmail.com> wrote:
> I guess you mean that BulkLoader is done by streaming whole SSTable to > remote servers, so it is faster? > Well, it's not exactly "whole SSTable" but yes, that's the sort of statement I'm making. [1] > The documentation says that all the rows in the SSTable will be inserted > into the new cluster conforming to the replication strategy of that > cluster. This gives me a felling that the BulkLoader was done by calling > insertion after being transmitted to coordinators. > A good slide-deck from pgorla, here : http://www.slideshare.net/DataStax/bulk-loading-data-into-cassandra General background. http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra But briefly, no. It uses the streaming interface, not the client interface. The streaming interface results in avoiding the whole commitlog/memtable process. I have this question because I tried batch insertion. It is too fast and > makes me think that BulkLoader can not beat it. > Turn of writes to the commitlog with durable_writes:false and you can simulate how much faster it would be without the double-write to the commitlog. That said, the double-write to the commitlog is one of the most significant overheads of doing a write from the client, but it is far from the only overhead. =Rob [1] http://www.datastax.com/dev/blog/streaming-in-cassandra-2-0