Re: Performance Difference between Batch Insert and Bulk Load

Ryan Svihla Mon, 01 Dec 2014 07:58:10 -0800

So there is a bit of a misunderstanding about the role of the coordinator
in all this. If you use an UNLOGGED BATCH and all of those writes are in
the same partition key, then yes it's a savings and acts as one mutation.
If they're not however, you're asking the coordinator node to do work the
client could do, and you're potentially adding an extra round hop on
several of those transactions if that coordinator node does not happen to
own that partition key (and assuming your client driver is using token
awareness, as it is in recent versions of the DataStax Java Driver. This
also says nothing of heap pressure, and the measurable effect of large
batches on node performance is in practice a problem in production clusters.


I frequently have had to switch people off using BATCH for bulk loading
style processes and in _every_ single case it's been faster to use
executeAsync..not to mention the cluster was healthier as a result.

As for the sstable loader options since they all use the streaming protocol
and as of today the streaming protocol will stream one copy to each remote
nodes, that they tend to be slower than even executeAsync in multi data
center scenarios (though in single data center they're faster options, that
said..the executeAsync approach is often fast enough).

This is all covered in a blog post
https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
and the DataStax CQL docs also reference BATCH is not a performance
optimization
http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html

In summary the only way UNLOGGED BATCH is a performance improvement over
using async with the driver is if they're within a certain reasonable size
and they're all to the same partition.

On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <daidon...@gmail.com> wrote:

> Thank a lot for the reply, Raj,
>
> I understand they are different. But if we define a Batch with UNLOGGED,
> it will not guarantee the atomic transaction, and become more like a data
> import tool. According to my knowledge, BATCH statement packs several
> mutations into one RPC to save time. Similarly, Bulk Loader also pack all
> the mutations as a SSTable file and (I think) may be able to save lot of
> time too.
>
> I am interested that, in the coordinator server, are Batch Insert and Bulk
> Loader the similar thing? I mean are they implemented in the similar way?
>
> P.S. I try to randomly insert 1000 rows into a simple table on my laptop
> as a test. Sync Insert will take almost 2s to finish, but sync batch insert
> only take like 900ms. It is a huge performance improvement, I wonder is
> this expected?
>
> Also, I used CQLSStableWriter to put these 1000 insertions into a single
> SSTable file, it costs around 2s to finish on my laptop. Seems to be pretty
> slow.
>
> thanks!
> - Dong
>
> > On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
> rnambood...@gmail.com> wrote:
> >
> > BATCH statement and Bulk Load are totally different things. The BATCH
> statement comes in the atomic transaction space which provides a way to
> make more than one statements into an atomic unit and bulk loader provides
> the ability to bulk load external data into a cluster. Two are totally
> different things and cannot be compared.
> >
> > Thanks
> > -Raj
> >
> > On 01-Dec-2014, at 4:32 am, Dong Dai <daidon...@gmail.com> wrote:
> >
> >> Hi, all,
> >>
> >> I have a performance question about the batch insert and bulk load.
> >>
> >> According to the documents, to import large volume of data into
> Cassandra, Batch Insert and Bulk Load can both be an option. Using batch
> insert is pretty straightforwards, but there have not been an ‘official’
> way to use Bulk Load to import the data (in this case, i mean the data was
> generated online).
> >>
> >> So, i am thinking first clients use CQLSSTableWriter to create the
> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to import
> these SSTables into Cassandra directly.
> >>
> >> The question is can I expect a better performance using the BulkLoader
> this way comparing with using Batch insert?
> >>
> >> I am not so familiar with the implementation of Bulk Load. But i do see
> a huge performance improvement using Batch Insert. Really want to know the
> upper limits of the write performance. Any comment will be helpful, Thanks!
> >>
> >> - Dong
> >>
> >
>
>


-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>


DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Performance Difference between Batch Insert and Bulk Load

Reply via email to