Re: batch_size_warn_threshold_in_kb

Eric Stevens Sat, 13 Dec 2014 09:42:39 -0800

Here is my test code; this was written as disposable code, so it's not
especially well documented, and it includes some chunks copied from
elsewhere in our stack, but hopefully it's readable.
https://gist.github.com/MightyE/1c98912fca104f6138fc


Here's some test runs after I reduced RF to 1, to introduce the effects of
proxying.  In the interests of total run time, I'm only running 10,000
records per run this time (but still 25 runs).  This is actually a bigger
percentage difference between single vs batch than the results I got
yesterday (500% difference between strategies with RF=3 , 800% difference
between strategies with RF=1).


==== Execution Results for 25 runs of 10000 records =============
25 runs of 10,000 records (3 protos, 5 agents, ~15 per bucket) as *single
statements*
Total Run Time
        futures test1 ((aid, bckt), proto, end) reverse order         =
8,336,113,981
        scatter test5 ((aid, bckt, end))                              =
8,434,901,305
        scatter test2 ((aid, bckt), end)                              =
8,464,319,637
        futures test3 ((aid, bckt), end, proto) reverse order         =
8,638,385,133
        futures test2 ((aid, bckt), end)                              =
8,684,263,854
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
8,708,544,870
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
8,861,009,724
        futures test5 ((aid, bckt, end))                              =
8,868,274,674
        scatter test3 ((aid, bckt), end, proto) reverse order         =
8,956,848,500
        scatter test1 ((aid, bckt), proto, end) reverse order         =
9,124,160,168
        parallel test2 ((aid, bckt), end)                             =
123,400,905,337
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
127,282,210,321
        parallel test1 ((aid, bckt), proto, end) reverse order        =
128,039,113,464
        parallel test5 ((aid, bckt, end))                             =
130,788,491,325
        parallel test3 ((aid, bckt), end, proto) reverse order        =
130,795,365,099
Fastest Run
        futures test1 ((aid, bckt), proto, end) reverse order         =
249,455,814
        futures test3 ((aid, bckt), end, proto) reverse order         =
252,083,763
        scatter test2 ((aid, bckt), end)                              =
267,123,409
        scatter test3 ((aid, bckt), end, proto) reverse order         =
268,881,765
        futures test2 ((aid, bckt), end)                              =
269,801,288
        scatter test1 ((aid, bckt), proto, end) reverse order         =
271,331,470
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
277,473,727
        scatter test5 ((aid, bckt, end))                              =
279,758,721
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
289,355,345
        futures test5 ((aid, bckt, end))                              =
292,192,770
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
4,457,096,575
        parallel test2 ((aid, bckt), end)                             =
4,481,373,458
        parallel test5 ((aid, bckt, end))                             =
4,488,146,940
        parallel test3 ((aid, bckt), end, proto) reverse order        =
4,514,067,019
        parallel test1 ((aid, bckt), proto, end) reverse order        =
4,559,735,697
Slowest Run
        scatter test2 ((aid, bckt), end)                              =
448,123,797
        futures test2 ((aid, bckt), end)                              =
451,813,924
        futures test1 ((aid, bckt), proto, end) reverse order         =
470,783,312
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
475,451,703
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
475,549,535
        scatter test5 ((aid, bckt, end))                              =
488,438,970
        futures test5 ((aid, bckt, end))                              =
512,183,494
        futures test3 ((aid, bckt), end, proto) reverse order         =
550,359,667
        scatter test3 ((aid, bckt), end, proto) reverse order         =
552,363,684
        scatter test1 ((aid, bckt), proto, end) reverse order         =
891,138,017
        parallel test2 ((aid, bckt), end)                             =
5,934,352,510
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
5,956,012,555
        parallel test3 ((aid, bckt), end, proto) reverse order        =
6,425,883,376
        parallel test1 ((aid, bckt), proto, end) reverse order        =
7,171,289,389
        parallel test5 ((aid, bckt, end))                             =
7,435,430,002

==== Execution Results for 25 runs of 10000 records =============
25 runs of 10,000 records (3 protos, 5 agents, ~15 per bucket) in *batches
of 100*
Total Run Time
        futures test1 ((aid, bckt), proto, end) reverse order         =
1,052,817,240
        futures test3 ((aid, bckt), end, proto) reverse order         =
1,172,170,041
        scatter test3 ((aid, bckt), end, proto) reverse order         =
1,288,773,642
        futures test2 ((aid, bckt), end)                              =
1,349,688,669
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
1,540,364,041
        scatter test5 ((aid, bckt, end))                              =
1,726,854,978
        futures test5 ((aid, bckt, end))                              =
1,937,696,565
        scatter test2 ((aid, bckt), end)                              =
1,977,232,999
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
2,025,161,088
        scatter test1 ((aid, bckt), proto, end) reverse order         =
2,219,169,805
        parallel test2 ((aid, bckt), end)                             =
4,046,119,649
        parallel test3 ((aid, bckt), end, proto) reverse order        =
4,125,620,234
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
4,216,984,290
        parallel test1 ((aid, bckt), proto, end) reverse order        =
5,168,760,200
        parallel test5 ((aid, bckt, end))                             =
6,523,329,362
Fastest Run
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
28,186,746
        scatter test1 ((aid, bckt), proto, end) reverse order         =
28,669,691
        scatter test3 ((aid, bckt), end, proto) reverse order         =
28,759,003
        futures test1 ((aid, bckt), proto, end) reverse order         =
28,779,728
        futures test3 ((aid, bckt), end, proto) reverse order         =
28,965,161
        scatter test2 ((aid, bckt), end)                              =
29,538,146
        futures test2 ((aid, bckt), end)                              =
29,698,726
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
29,985,012
        futures test5 ((aid, bckt, end))                              =
45,124,539
        scatter test5 ((aid, bckt, end))                              =
45,457,721
        parallel test1 ((aid, bckt), proto, end) reverse order        =
117,402,203
        parallel test3 ((aid, bckt), end, proto) reverse order        =
118,801,014
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
120,824,212
        parallel test2 ((aid, bckt), end)                             =
123,429,148
        parallel test5 ((aid, bckt, end))                             =
150,506,937
Slowest Run
        futures test1 ((aid, bckt), proto, end) reverse order         =
123,977,889
        scatter test5 ((aid, bckt, end))                              =
132,461,653
        scatter test3 ((aid, bckt), end, proto) reverse order         =
162,124,982
        futures test5 ((aid, bckt, end))                              =
169,519,567
        scatter test4 ((aid, bckt), proto, end) no explicit ordering  =
179,664,998
        futures test3 ((aid, bckt), end, proto) reverse order         =
194,697,734
        futures test2 ((aid, bckt), end)                              =
196,393,591
        parallel test3 ((aid, bckt), end, proto) reverse order        =
283,316,486
        parallel test4 ((aid, bckt), proto, end) no explicit ordering =
297,212,641
        parallel test2 ((aid, bckt), end)                             =
313,276,348
        futures test4 ((aid, bckt), proto, end) no explicit ordering  =
575,294,961
        parallel test1 ((aid, bckt), proto, end) reverse order        =
630,158,596
        scatter test1 ((aid, bckt), proto, end) reverse order         =
694,183,510
        parallel test5 ((aid, bckt, end))                             =
740,323,929
        scatter test2 ((aid, bckt), end)                              =
833,222,720



On Sat, Dec 13, 2014 at 9:44 AM, Eric Stevens <migh...@gmail.com> wrote:

> You can seen what the partition key strategies are for each of the tables,
> test5 shows the least improvement.  The set (aid, end) should be unique,
> and bckt is derived from end.  Some of these layouts result in clustering
> on the same partition keys, that's actually tunable with the "~15 per
> bucket" reported (exact number of entries per bucket will vary but should
> have a mean of 15 in that run - it's an input parameter to my tests).
>  "test5" obviously ends up being exclusively unique partitions for each
> record.
>
> Your points about:
> 1) Failed batches having a higher cost than failed single statements
> 2) In my test, every node was a replica for all data.
>
> These are both very good points.
>
> For #1, since the worst case scenario is nearly twice fast in batches as
> its single statement equivalent, in terms of impact on the client, you'd
> have to be retrying half your batches before you broke even there (but of
> course those retries are not free to the cluster, so you probably make the
> performance tipping point approach a lot faster).  This alone may be cause
> to justify avoiding batches, or at least severely limiting their size (hey,
> that's what this discussion is about!).
>
> For #2, that's certainly a good point, for this test cluster, I should at
> least re-run with RF=1 so that proxying times start to matter.  If you're
> not using a token aware client or not using a token aware policy for
> whatever reason, this should even out though, no?  Each node will end up
> coordinating 1/(nodecount-rf+1) mutations, regardless of whether they are
> batched or single statements.  The DS driver is very careful to caution
> that the topology map it maintains makes no guarantees on freshness, so you
> may see a significant performance penalty in your client when the topology
> changes if you're depending on token aware routing as part of your
> performance requirements.
>
>
> I'm curious what your thoughts are on grouping statements by primary
> replica according to the routing policy, and executing unlogged batches
> that way (so that for token aware routing, all statements are executed on a
> replica, for others it'd make no difference).  Retries are still more
> expensive, but token aware proxying avoidance is still had.  It's pretty
> easy to do in Scala:
>
>   def groupByFirstReplica(statements: Iterable[Statement])(implicit
> session: Session): Map[Host, Seq[Statement]] = {
>     val meta = session.getCluster.getMetadata
>     statements.groupBy { st =>
>       meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
>     }
>   }
>   val result =
> Future.traverse(groupByFirstReplica(statements).values).map(st =>
> newBatch(st).executeAsync())
>
>
> Let me get together my test code, it depends on some existing utilities we
> use elsewhere, such as implicit conversions between Google and Scala native
> futures.  I'll try to put this together in a format that's runnable for you
> in a Scala REPL console without having to resolve our internal
> dependencies.  This may not be today though.
>
> Also, @Ryan, I don't think that shuffling would make a difference for my
> above tests since as Jon observed, all my nodes were already replicas there.
>
>
> On Sat, Dec 13, 2014 at 7:37 AM, Ryan Svihla <rsvi...@datastax.com> wrote:
>
>> Also..what happens when you turn on shuffle with token aware?
>> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/policies/TokenAwarePolicy.html
>>
>> On Sat, Dec 13, 2014 at 8:21 AM, Jonathan Haddad <j...@jonhaddad.com>
>> wrote:
>>>
>>> To add to Ryan's (extremely valid!) point, your test works because the
>>> coordinator is always a replica.  Try again using 20 (or 50) nodes.
>>> Batching works great at RF=N=3 because it always gets to write to local and
>>> talk to exactly 2 other servers on every request.  Consider what happens
>>> when the coordinator needs to talk to 100 servers.  It's unnecessary
>>> overhead on the server side.
>>>
>>> To save network overhead, Cassandra 2.1 added support for response
>>> grouping (see
>>> http://www.datastax.com/dev/blog/cassandra-2-1-now-over-50-faster)
>>> which massively helps performance.  It provides the benefit of batches but
>>> without the coordinator overhead.
>>>
>>> Can you post your benchmark code?
>>>
>>> On Sat Dec 13 2014 at 6:10:36 AM Jonathan Haddad <j...@jonhaddad.com>
>>> wrote:
>>>
>>>> There are cases where it can.  For instance, if you batch multiple
>>>> mutations to the same partition (and talk to a replica for that partition)
>>>> they can reduce network overhead because they're effectively a single
>>>> mutation in the eye of the cluster.  However, if you're not doing that (and
>>>> most people aren't!) you end up putting additional pressure on the
>>>> coordinator because now it has to talk to several other servers.  If you
>>>> have 100 servers, and perform a mutation on 100 partitions, you could have
>>>> a coordinator that's
>>>>
>>>> 1) talking to every machine in the cluster and
>>>> b) waiting on a response from a significant portion of them
>>>>
>>>> before it can respond success or fail.  Any delay, from GC to a bad
>>>> disk, can affect the performance of the entire batch.
>>>>
>>>>
>>>> On Sat Dec 13 2014 at 4:17:33 AM Jack Krupansky <
>>>> j...@basetechnology.com> wrote:
>>>>
>>>>>   Jonathan and Ryan,
>>>>>
>>>>> Jonathan says “It is absolutely not going to help you if you're trying
>>>>> to lump queries together to reduce network & server overhead - in fact
>>>>> it'll do the opposite”, but I would note that the CQL3 spec says “The
>>>>> BATCH statement ... serves several purposes: 1. It saves network
>>>>> round-trips between the client and the server (and sometimes between the
>>>>> server coordinator and the replicas) when batching multiple updates.” Is
>>>>> the spec inaccurate? I mean, it seems in conflict with your statement.
>>>>>
>>>>> See:
>>>>> https://cassandra.apache.org/doc/cql3/CQL.html
>>>>>
>>>>> I see the spec as gospel – if it’s not accurate, let’s propose a
>>>>> change to make it accurate.
>>>>>
>>>>> The DataStax CQL doc is more nuanced: “Batching multiple statements
>>>>> can save network exchanges between the client/server and server
>>>>> coordinator/replicas. However, because of the distributed nature of
>>>>> Cassandra, spread requests across nearby nodes as much as possible to
>>>>> optimize performance. Using batches to optimize performance is usually not
>>>>> successful, as described in Using and misusing batches section. For
>>>>> information about the fastest way to load data, see "Cassandra: Batch
>>>>> loading without the Batch keyword."”
>>>>>
>>>>> Maybe what we really need is a “client/driver-side batch”, which is
>>>>> simply a way to collect “batches” of operations in the client/driver and
>>>>> then let the driver determine what degree of batching and asynchronous
>>>>> operation is appropriate.
>>>>>
>>>>> It might also be nice to have an inquiry for the cluster as to what
>>>>> batch size is most optimal for the cluster, like number of mutations in a
>>>>> batch and number of simultaneous connections, and to have that be dynamic
>>>>> based on overall cluster load.
>>>>>
>>>>> I would also note that the example in the spec has multiple inserts
>>>>> with different partition key values, which flies in the face of the
>>>>> admonition to to refrain from using server-side distribution of requests.
>>>>>
>>>>> At a minimum the CQL spec should make a more clear statement of intent
>>>>> and non-intent for BATCH.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>>  *From:* Jonathan Haddad <j...@jonhaddad.com>
>>>>> *Sent:* Friday, December 12, 2014 12:58 PM
>>>>> *To:* user@cassandra.apache.org ; Ryan Svihla <rsvi...@datastax.com>
>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>
>>>>> The really important thing to really take away from Ryan's original
>>>>> post is that batches are not there for performance.  The only case I
>>>>> consider batches to be useful for is when you absolutely need to know that
>>>>> several tables all get a mutation (via logged batches).  The use case for
>>>>> this is when you've got multiple tables that are serving as different 
>>>>> views
>>>>> for data.  It is absolutely not going to help you if you're trying to lump
>>>>> queries together to reduce network & server overhead - in fact it'll do 
>>>>> the
>>>>> opposite.  If you're trying to do that, instead perform many async
>>>>> queries.  The overhead of batches in cassandra is significant and you're
>>>>> going to hit a lot of problems if you use them excessively (timeouts /
>>>>> failures).
>>>>>
>>>>> tl;dr: you probably don't want batch, you most likely want many async
>>>>> calls
>>>>>
>>>>> On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <
>>>>> moham...@glassbeam.com> wrote:
>>>>>
>>>>>>  Ryan,
>>>>>>
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I did see that jira before posting my question on this list. However,
>>>>>> I didn’t see any information about why 5kb+ data will cause instability.
>>>>>> 5kb or even 50kb seems too small. For example, if each mutation is 1000+
>>>>>> bytes, then with just 5 mutations, you will hit that threshold.
>>>>>>
>>>>>>
>>>>>>
>>>>>> In addition, Patrick is saying that he does not recommend more than
>>>>>> 100 mutations per batch. So why not warn users just on the # of mutations
>>>>>> in a batch?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mohammed
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Ryan Svihla [mailto:rsvi...@datastax.com]
>>>>>> *Sent:* Thursday, December 11, 2014 12:56 PM
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Re: batch_size_warn_threshold_in_kb
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nothing magic, just put in there based on experience. You can find
>>>>>> the story behind the original recommendation here
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-6487
>>>>>>
>>>>>>
>>>>>>
>>>>>> Key reasoning for the desire comes from Patrick McFadden:
>>>>>>
>>>>>>
>>>>>> "Yes that was in bytes. Just in my own experience, I don't recommend
>>>>>> more than ~100 mutations per batch. Doing some quick math I came up with 
>>>>>> 5k
>>>>>> as 100 x 50 byte mutations.
>>>>>>
>>>>>> Totally up for debate."
>>>>>>
>>>>>>
>>>>>>
>>>>>> It's totally changeable, however, it's there in no small part because
>>>>>> so many people confuse the BATCH keyword as a performance optimization,
>>>>>> this helps flag those cases of misuse.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <
>>>>>> moham...@glassbeam.com> wrote:
>>>>>>
>>>>>> Hi –
>>>>>>
>>>>>> The cassandra.yaml file has property called 
>>>>>> *batch_size_warn_threshold_in_kb.
>>>>>> *
>>>>>>
>>>>>> The default size is 5kb and according to the comments in the yaml
>>>>>> file, it is used to log WARN on any batch size exceeding this value in
>>>>>> kilobytes. It says caution should be taken on increasing the size of this
>>>>>> threshold as it can lead to node instability.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does anybody know the significance of this magic number 5kb? Why
>>>>>> would a higher number (say 10kb) lead to node instability?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mohammed
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>
>>>>>> Ryan Svihla
>>>>>>
>>>>>> Solution Architect
>>>>>>
>>>>>>
>>>>>> [image: twitter.png] <https://twitter.com/foundev>[image:
>>>>>> linkedin.png] <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>> scalable to any size. With more than 500 customers in 45 countries, 
>>>>>> DataStax
>>>>>> is the database technology and transactional backbone of choice for the
>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and 
>>>>>> eBay.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
>> --
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>

Re: batch_size_warn_threshold_in_kb

Reply via email to