Re: Performance Difference between Batch Insert and Bulk Load

Aleksey Yeschenko Tue, 02 Dec 2014 07:36:04 -0800

Guys, please move this discussion to users mailing list. This one is for 
Cassandra committers and other contributors, to discuss development of 
Cassandra itself.


--
AY

> On Dec 2, 2014, at 16:17, Ryan Svihla <rsvi...@datastax.com> wrote:
> 
> mispoke
> 
> "That's all correct but what you're not accounting for is if you use a
> token aware client then the coordinator will likely not own all the data in
> a batch"
> 
> should just be
> 
> "That's all correct but what you're not accounting for is the coordinator
> will likely not own all the data in a batch"
> 
> Token awareness has no effect on that fact.
> 
>> On Tue, Dec 2, 2014 at 9:13 AM, Ryan Svihla <rsvi...@datastax.com> wrote:
>> 
>> 
>> 
>>> On Mon, Dec 1, 2014 at 1:52 PM, Dong Dai <daidon...@gmail.com> wrote:
>>> 
>>> Thanks Ryan, and also thanks for your great blog post.
>>> 
>>> However, this makes me more confused. Mainly about the coordinators.
>>> 
>>> Based on my understanding, no matter it is batch insertion, ordinary sync
>>> insert, or async insert,
>>> the coordinator was only selected once for the whole session by calling
>>> cluster.connect(), and after
>>> that, all the insertions will go through that coordinator.
>> 
>> That's all correct but what you're not accounting for is if you use a
>> token aware client then the coordinator will likely not own all the data in
>> a batch, ESPECIALLY as you scale up to more nodes. If you are using
>> executeAsync and a single row then the coordinator node will always be an
>> owner of the data, thereby minimizing network hops. Some people now stop me
>> and say "but the client is making those hops!", and that's when I point out
>> "what do you think the coordinator has to do", only you've introduced
>> something in the middle, and prevent token awareness from doing it's job.
>> The savings in latency are particularly huge if you use more than a
>> consistency level one on your write.
>> 
>> 
>>> If this is not the case, and the clients do more work, like distribute
>>> each insert to different
>>> coordinators based on its partition key. It is understandable the large
>>> volume of UNLOGGED BATCH
>>> will cause some bottleneck in the coordinator server. However, this
>>> should be not hard to solve by distributing
>>> insertions in one batch into different coordinators based on partition
>>> keys. I will be curious why
>>> this is not supported.
>> 
>> The coordinator node does this of course today, but this is the very
>> bottleneck of which you refer. To do what you're wanting to do and make it
>> work, you'd have to enhance the CLIENT to make sure that all the objects in
>> that batch were actually owned by the coordinator itself, and if you're
>> talking about parsing a CQL BATCH on the client and splitting it out to the
>> appropriate nodes in some sort of hyper token awareness, then you're taking
>> a server side responsibility (CQL parsing) and moving it to the client.
>> Worse you're asking for a number of bugs to occur by moving CQL parsing to
>> the client, IE do all clients handle this the same way? what happens to
>> older thrift clients with batch?, etc, etc, etc.
>> 
>> Final point, every time you do a batch you're adding extra load on the
>> heap to the coordinator node that could be instead on the client. This
>> cannot be stated strongly enough. In production doing large batches (say
>> over 5k) is a wonderful way to make your node spend a lot of it's time
>> handling batches and the overhead of that process.
>> 
>>> 
>>> P.S. I have the asynchronous insertion tested, probably because my
>>> dataset is small. Batch insertion
>>> is always much better than async insertions. Do you have a general idea
>>> how large the dataset should be
>>> to reverse this performance comparison.
>> 
>> You could be in a situation where the node owns all the data, and so can
>> respond quickly, so it's hard to say, you can see however as the cluster
>> scales there is no way that a given node will own everything in the batch
>> unless you've designed it to be that way, either by some token aware batch
>> generation in the client or by only batching on the same partition key
>> (strategy covered in that blog).
>> 
>> PS Every time I've had a customer tell me batch is faster than async, it's
>> been a code problem such as not storing futures for later, or in Python not
>> using libev, in all cases I've gotten at least 2x speed up and often way
>> more.
>> 
>> 
>>> - Dong
>>> 
>>>> On Dec 1, 2014, at 9:57 AM, Ryan Svihla <rsvi...@datastax.com> wrote:
>>>> 
>>>> So there is a bit of a misunderstanding about the role of the
>>> coordinator
>>>> in all this. If you use an UNLOGGED BATCH and all of those writes are in
>>>> the same partition key, then yes it's a savings and acts as one
>>> mutation.
>>>> If they're not however, you're asking the coordinator node to do work
>>> the
>>>> client could do, and you're potentially adding an extra round hop on
>>>> several of those transactions if that coordinator node does not happen
>>> to
>>>> own that partition key (and assuming your client driver is using token
>>>> awareness, as it is in recent versions of the DataStax Java Driver. This
>>>> also says nothing of heap pressure, and the measurable effect of large
>>>> batches on node performance is in practice a problem in production
>>> clusters.
>>>> 
>>>> I frequently have had to switch people off using BATCH for bulk loading
>>>> style processes and in _every_ single case it's been faster to use
>>>> executeAsync..not to mention the cluster was healthier as a result.
>>>> 
>>>> As for the sstable loader options since they all use the streaming
>>> protocol
>>>> and as of today the streaming protocol will stream one copy to each
>>> remote
>>>> nodes, that they tend to be slower than even executeAsync in multi data
>>>> center scenarios (though in single data center they're faster options,
>>> that
>>>> said..the executeAsync approach is often fast enough).
>>>> 
>>>> This is all covered in a blog post
>>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
>>>> and the DataStax CQL docs also reference BATCH is not a performance
>>>> optimization
>>> http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html
>>>> 
>>>> In summary the only way UNLOGGED BATCH is a performance improvement over
>>>> using async with the driver is if they're within a certain reasonable
>>> size
>>>> and they're all to the same partition.
>>>> 
>>>>> On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <daidon...@gmail.com> wrote:
>>>>> 
>>>>> Thank a lot for the reply, Raj,
>>>>> 
>>>>> I understand they are different. But if we define a Batch with
>>> UNLOGGED,
>>>>> it will not guarantee the atomic transaction, and become more like a
>>> data
>>>>> import tool. According to my knowledge, BATCH statement packs several
>>>>> mutations into one RPC to save time. Similarly, Bulk Loader also pack
>>> all
>>>>> the mutations as a SSTable file and (I think) may be able to save lot
>>> of
>>>>> time too.
>>>>> 
>>>>> I am interested that, in the coordinator server, are Batch Insert and
>>> Bulk
>>>>> Loader the similar thing? I mean are they implemented in the similar
>>> way?
>>>>> 
>>>>> P.S. I try to randomly insert 1000 rows into a simple table on my
>>> laptop
>>>>> as a test. Sync Insert will take almost 2s to finish, but sync batch
>>> insert
>>>>> only take like 900ms. It is a huge performance improvement, I wonder is
>>>>> this expected?
>>>>> 
>>>>> Also, I used CQLSStableWriter to put these 1000 insertions into a
>>> single
>>>>> SSTable file, it costs around 2s to finish on my laptop. Seems to be
>>> pretty
>>>>> slow.
>>>>> 
>>>>> thanks!
>>>>> - Dong
>>>>> 
>>>>>>> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
>>>>>> rnambood...@gmail.com> wrote:
>>>>>> 
>>>>>> BATCH statement and Bulk Load are totally different things. The BATCH
>>>>> statement comes in the atomic transaction space which provides a way to
>>>>> make more than one statements into an atomic unit and bulk loader
>>> provides
>>>>> the ability to bulk load external data into a cluster. Two are totally
>>>>> different things and cannot be compared.
>>>>>> 
>>>>>> Thanks
>>>>>> -Raj
>>>>>> 
>>>>>>> On 01-Dec-2014, at 4:32 am, Dong Dai <daidon...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi, all,
>>>>>>> 
>>>>>>> I have a performance question about the batch insert and bulk load.
>>>>>>> 
>>>>>>> According to the documents, to import large volume of data into
>>>>> Cassandra, Batch Insert and Bulk Load can both be an option. Using
>>> batch
>>>>> insert is pretty straightforwards, but there have not been an
>>> ‘official’
>>>>> way to use Bulk Load to import the data (in this case, i mean the data
>>> was
>>>>> generated online).
>>>>>>> 
>>>>>>> So, i am thinking first clients use CQLSSTableWriter to create the
>>>>> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to
>>> import
>>>>> these SSTables into Cassandra directly.
>>>>>>> 
>>>>>>> The question is can I expect a better performance using the
>>> BulkLoader
>>>>> this way comparing with using Batch insert?
>>>>>>> 
>>>>>>> I am not so familiar with the implementation of Bulk Load. But i do
>>> see
>>>>> a huge performance improvement using Batch Insert. Really want to know
>>> the
>>>>> upper limits of the write performance. Any comment will be helpful,
>>> Thanks!
>>>>>>> 
>>>>>>> - Dong
>>>> 
>>>> 
>>>> --
>>>> 
>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>> 
>>>> Ryan Svihla
>>>> 
>>>> Solution Architect
>>>> 
>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>> linkedin.png]
>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>> 
>>>> 
>>>> DataStax is the fastest, most scalable distributed database technology,
>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>> Datastax is built to be agile, always-on, and predictably scalable to
>>> any
>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>> database technology and transactional backbone of choice for the worlds
>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>> 
>> 
>> --
>> 
>> [image: datastax_logo.png] <http://www.datastax.com/>
>> 
>> Ryan Svihla
>> 
>> Solution Architect
>> 
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>> 
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
> 
> 
> -- 
> 
> [image: datastax_logo.png] <http://www.datastax.com/>
> 
> Ryan Svihla
> 
> Solution Architect
> 
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
> 
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Performance Difference between Batch Insert and Bulk Load

Reply via email to