Re: Hinted Handoff/Atomic Updates & SnapTree replacement with BTree

Jonathan Ellis Sun, 13 Jul 2014 20:17:18 -0700

Thanks for the analysis.  It should be pretty straightforward to
assign hints to one of 10 random partitions, then handoff can just
scan for each partition key.  Small penalty on handoff to mitigate
your problem.


On Sun, Jul 13, 2014 at 4:25 PM, graham sanderson <gra...@vast.com> wrote:
> Here are some numbers from latest 2.1 (on the production level h/w)
>
> BTree is better off than was SnapTreeMap (note also that I allocated my
> cells on heap, however all of the extra allocation under race conditions
> itself is actually the Object[], so is going to be on heap anyway):
> - Generally the data structure overhead is much smaller
> - The contention cost drops off much quicker with number of partitions (i.e.
> 16 is actually pretty fine, though if you’re going with >1 then 256 is not
> much harder than 16 unless you want to use a single IN clause).
>
> However it still does suffer when a huge amount of almost sorted data is
> inserted very fast as in the hints case (note I tried with random timestamps
> out of curiosity, and it was somewhat better as expected, but not
> significantly).
>
> So the moral of the story seems to be that we probably shouldn’t be writing
> hints for a node to a single partition… (ironically lots of small inserts
> means that the write rate is faster, hence more concurrency of hinting, and
> a mere 100M of inserted data ends up allocating 10s of gigs of RAM on the
> node during the hinting process) any thoughts appreciated, but I’ll likely
> move thread to a ticket shortly.
>
> Note, finally that with SnapTreeMap a lot of extra rapidly allocated data
> was live long enough to spill into the old gen, causing eventual promotion
> failure of memtables on the hinting nodes too, that may not be the case with
> BTree on 2.1, and of course off heap memtables will help this issue too.
>
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>     [junit]   Duration = 1129ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 36 ms for 2 collections
>     [junit]   Approx allocation = 249MB vs 38MB; ratio to raw data size =
> 6.5352668
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>     [junit]   Duration = 349ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 11 ms for 1 collections
>     [junit]   Approx allocation = 197MB vs 38MB; ratio to raw data size =
> 5.1714
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>     [junit]   Duration = 835ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 19 ms for 1 collections
>     [junit]   Approx allocation = 147MB vs 38MB; ratio to raw data size =
> 3.8699954
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>     [junit]   Duration = 635ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 23 ms for 1 collections
>     [junit]   Approx allocation = 118MB vs 38MB; ratio to raw data size =
> 3.1053228
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>     [junit]   Duration = 2079ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 130 ms for 24 collections
>     [junit]   Approx allocation = 6484MB vs 38MB; ratio to raw data size =
> 169.9863816
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>     [junit]   Duration = 363ms maxConcurrency = 100
>     [junit]   Approx allocation = 238MB vs 38MB; ratio to raw data size =
> 6.2529482
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>     [junit]   Duration = 370ms maxConcurrency = 96
>     [junit]   Approx allocation = 154MB vs 38MB; ratio to raw data size =
> 4.055179
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>     [junit]   Duration = 353ms maxConcurrency = 100
>     [junit]   Approx allocation = 124MB vs 38MB; ratio to raw data size =
> 3.2727604
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>     [junit]   Duration = 1055ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 19 ms for 1 collections
>     [junit]   Approx allocation = 237MB vs 83MB; ratio to raw data size =
> 2.831478363636364
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>     [junit]   Duration = 667ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 16 ms for 1 collections
>     [junit]   Approx allocation = 235MB vs 83MB; ratio to raw data size =
> 2.802512909090909
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>     [junit]   Duration = 606ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 19 ms for 1 collections
>     [junit]   Approx allocation = 175MB vs 83MB; ratio to raw data size =
> 2.091785727272727
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>     [junit]   Duration = 268ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 14 ms for 1 collections
>     [junit]   Approx allocation = 149MB vs 83MB; ratio to raw data size =
> 1.7858322727272726
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>     [junit]   Duration = 2120ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 143 ms for 23 collections
>     [junit]   Approx allocation = 6248MB vs 83MB; ratio to raw data size =
> 74.454192
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>     [junit]   Duration = 310ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 33 ms for 1 collections
>     [junit]   Approx allocation = 236MB vs 83MB; ratio to raw data size =
> 2.817435
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>     [junit]   Duration = 383ms maxConcurrency = 99
>     [junit]   Approx allocation = 199MB vs 83MB; ratio to raw data size =
> 2.3777502727272726
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1024
>     [junit]   Duration = 387ms maxConcurrency = 98
>     [junit]   GC for PS Scavenge: 25 ms for 1 collections
>     [junit]   Approx allocation = 176MB vs 83MB; ratio to raw data size =
> 2.1082142727272726
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>     [junit]   Duration = 1230ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 64 ms for 3 collections
>     [junit]   Approx allocation = 457MB vs 267MB; ratio to raw data size =
> 1.711579
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>     [junit]   Duration = 898ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 58 ms for 3 collections
>     [junit]   Approx allocation = 427MB vs 267MB; ratio to raw data size =
> 1.6025206571428572
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>     [junit]   Duration = 944ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 72 ms for 3 collections
>     [junit]   Approx allocation = 377MB vs 267MB; ratio to raw data size =
> 1.4138783714285714
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>     [junit]   Duration = 877ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 78 ms for 3 collections
>     [junit]   Approx allocation = 347MB vs 267MB; ratio to raw data size =
> 1.3008453714285715
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>     [junit]   Duration = 2073ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 166 ms for 28 collections
>     [junit]   Approx allocation = 7052MB vs 267MB; ratio to raw data size =
> 26.410422285714287
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>     [junit]   Duration = 398ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 45 ms for 2 collections
>     [junit]   Approx allocation = 375MB vs 267MB; ratio to raw data size =
> 1.4073654
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 256
>     [junit]   Duration = 448ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 71 ms for 3 collections
>     [junit]   Approx allocation = 369MB vs 267MB; ratio to raw data size =
> 1.3841326
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1024
>     [junit]   Duration = 448ms maxConcurrency = 99
>     [junit]   GC for PS Scavenge: 77 ms for 3 collections
>     [junit]   Approx allocation = 354MB vs 267MB; ratio to raw data size =
> 1.3269438571428571
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] ==================================================
>
> On Jul 13, 2014, at 11:34 AM, graham sanderson <gra...@vast.com> wrote:
>
> OK, some numbers from my test:
>
> Basically the test mimics the 2.0.x write path for a hint -
> there are P partitions (each has an AtomicReference to a SnapTreeMap sorted
> by timeuuid comparator)
> there are N threads
> “elements” is the total number of timeUUID/value pairs (aka columns)
> each thread inserts the same number of elements (“elements”/N)
> inserting the element is done by getting the atomic reference, cloning the
> SnapTreeMap, adding the element, and trying to CAS the updated reference
> back.
> the value of the element in this case is a byte[] of the size indicated
> allocated by the current thread
>
> the test counts approximately the amount of eden allocation… the ratio shown
> is how much was allocated vs the rough size of the raw data to be inserted.
>
> This test is a worst case scenario in so far as there is no sleeping (though
> on my laptop as you can see from maxConcurrency - which is the maximum
> number of threads in the element loop at any point, I fail to hit it that
> very hard, whereas one of our production class machines can maintain 100
> thread concurrency)
>
> Full results are at the bottom (note I ran with 1 thread an 100 threads,
> element sizes 64, 256, 1024 bytes, and 1, 16, 256, 1024 partitions)
>
> Here are some worst case examples (obviously the ratio is higher for even
> high partition counts here, because the size of even a single tree structure
> becomes significant compared to the size of the data stored):
>
>
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>     [junit]   Duration = 3922ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 101 ms for 28 collections
>     [junit]   Approx allocation = 8691MB vs 38MB; ratio to raw data size =
> 227.833744
>     [junit]
>
> vs
>
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>     [junit]   Duration = 399ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 21 ms for 1 collections
>     [junit]   Approx allocation = 320MB vs 38MB; ratio to raw data size =
> 8.4010366
>
> Note, I’ll try the same test on 2.1 against the new write path.
>
> =====
>
> FYI insertion of one element
>
>                         AtomicReference<SnapTreeMap<ByteBuffer, byte[]>> ref
> = maps.get(random.nextInt(partitionCount));
>                         SnapTreeMap<ByteBuffer, byte[]> current, modified;
>                         do
>                         {
>                             current = ref.get();
>                             modified = current.clone();
>                             if (null != modified.putIfAbsent(key, value))
>                             {
>                                 modified.replace(key, value);
>                             }
>                         } while (!ref.compareAndSet(current, modified));
>
>
> Full results from my laptop
>
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>     [junit]   Duration = 499ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 29 ms for 4 collections
>     [junit]   Approx allocation = 539MB vs 38MB; ratio to raw data size =
> 14.1319278
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>     [junit]   Duration = 445ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 26 ms for 3 collections
>     [junit]   Approx allocation = 446MB vs 38MB; ratio to raw data size =
> 11.7168844
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>     [junit]   Duration = 395ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 20 ms for 2 collections
>     [junit]   Approx allocation = 370MB vs 38MB; ratio to raw data size =
> 9.7106606
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>     [junit]   Duration = 399ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 24 ms for 2 collections
>     [junit]   Approx allocation = 311MB vs 38MB; ratio to raw data size =
> 8.1768922
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>     [junit]   Duration = 1143ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 148 ms for 18 collections
>     [junit]   Approx allocation = 3853MB vs 38MB; ratio to raw data size =
> 101.0072052
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>     [junit]   Duration = 171ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 22 ms for 2 collections
>     [junit]   Approx allocation = 605MB vs 38MB; ratio to raw data size =
> 15.8683486
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>     [junit]   Duration = 8ms maxConcurrency = 12
>     [junit]   GC for PS Scavenge: 14 ms for 1 collections
>     [junit]   Approx allocation = 370MB vs 38MB; ratio to raw data size =
> 9.713281
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>     [junit]   Duration = 3ms maxConcurrency = 9
>     [junit]   GC for PS Scavenge: 20 ms for 1 collections
>     [junit]   Approx allocation = 327MB vs 38MB; ratio to raw data size =
> 8.5870928
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>     [junit]   Duration = 492ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 25 ms for 2 collections
>     [junit]   Approx allocation = 557MB vs 83MB; ratio to raw data size =
> 6.638187636363637
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>     [junit]   Duration = 432ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 26 ms for 2 collections
>     [junit]   Approx allocation = 470MB vs 83MB; ratio to raw data size =
> 5.603587909090909
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>     [junit]   Duration = 430ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 38 ms for 3 collections
>     [junit]   Approx allocation = 406MB vs 83MB; ratio to raw data size =
> 4.845429545454546
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>     [junit]   Duration = 433ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 41 ms for 3 collections
>     [junit]   Approx allocation = 355MB vs 83MB; ratio to raw data size =
> 4.231936636363637
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>     [junit]   Duration = 1154ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 163 ms for 25 collections
>     [junit]   Approx allocation = 4344MB vs 83MB; ratio to raw data size =
> 51.769667727272726
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>     [junit]   Duration = 279ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 61 ms for 4 collections
>     [junit]   Approx allocation = 506MB vs 83MB; ratio to raw data size =
> 6.037446818181818
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>     [junit]   Duration = 6ms maxConcurrency = 11
>     [junit]   GC for PS Scavenge: 36 ms for 3 collections
>     [junit]   Approx allocation = 402MB vs 83MB; ratio to raw data size =
> 4.795717636363636
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1024
>     [junit]   Duration = 5ms maxConcurrency = 38
>     [junit]   GC for PS Scavenge: 46 ms for 3 collections
>     [junit]   Approx allocation = 362MB vs 83MB; ratio to raw data size =
> 4.3233903636363635
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>     [junit]   Duration = 595ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 109 ms for 6 collections
>     [junit]   Approx allocation = 766MB vs 267MB; ratio to raw data size =
> 2.868619457142857
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>     [junit]   Duration = 503ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 66 ms for 5 collections
>     [junit]   Approx allocation = 676MB vs 267MB; ratio to raw data size =
> 2.531949057142857
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>     [junit]   Duration = 498ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 78 ms for 5 collections
>     [junit]   Approx allocation = 583MB vs 267MB; ratio to raw data size =
> 2.1853203714285714
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>     [junit]   Duration = 473ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 86 ms for 4 collections
>     [junit]   Approx allocation = 537MB vs 267MB; ratio to raw data size =
> 2.012229685714286
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>     [junit]   Duration = 1338ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 197 ms for 30 collections
>     [junit]   Approx allocation = 3883MB vs 267MB; ratio to raw data size =
> 14.5419178
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>     [junit]   Duration = 278ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 128 ms for 6 collections
>     [junit]   Approx allocation = 752MB vs 267MB; ratio to raw data size =
> 2.817227914285714
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 256
>     [junit]   Duration = 80ms maxConcurrency = 42
>     [junit]   GC for PS Scavenge: 99 ms for 5 collections
>     [junit]   Approx allocation = 596MB vs 267MB; ratio to raw data size =
> 2.2341639714285715
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1024
>     [junit]   Duration = 4ms maxConcurrency = 26
>     [junit]   GC for PS Scavenge: 76 ms for 4 collections
>     [junit]   Approx allocation = 547MB vs 267MB; ratio to raw data size =
> 2.0490342285714287
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] ==================================================
>
> Full results from a 16 core (32 thread) machine
>
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1
>     [junit]   Duration = 1431ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 52 ms for 4 collections
>     [junit]   Approx allocation = 572MB vs 38MB; ratio to raw data size =
> 15.0195638
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 16
>     [junit]   Duration = 1216ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 46 ms for 3 collections
>     [junit]   Approx allocation = 491MB vs 38MB; ratio to raw data size =
> 12.8950428
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 256
>     [junit]   Duration = 899ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 22 ms for 2 collections
>     [junit]   Approx allocation = 458MB vs 38MB; ratio to raw data size =
> 12.0219362
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 64) partitions = 1024
>     [junit]   Duration = 546ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 10 ms for 1 collections
>     [junit]   Approx allocation = 309MB vs 38MB; ratio to raw data size =
> 8.1213414
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1
>     [junit]   Duration = 3922ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 101 ms for 28 collections
>     [junit]   Approx allocation = 8691MB vs 38MB; ratio to raw data size =
> 227.833744
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 16
>     [junit]   Duration = 255ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 18 ms for 2 collections
>     [junit]   Approx allocation = 654MB vs 38MB; ratio to raw data size =
> 17.1485322
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 256
>     [junit]   Duration = 225ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 13 ms for 1 collections
>     [junit]   Approx allocation = 368MB vs 38MB; ratio to raw data size =
> 9.6528628
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 64) partitions = 1024
>     [junit]   Duration = 399ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 21 ms for 1 collections
>     [junit]   Approx allocation = 320MB vs 38MB; ratio to raw data size =
> 8.4010366
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1
>     [junit]   Duration = 1350ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 31 ms for 2 collections
>     [junit]   Approx allocation = 462MB vs 83MB; ratio to raw data size =
> 5.509877818181818
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 16
>     [junit]   Duration = 861ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 28 ms for 2 collections
>     [junit]   Approx allocation = 491MB vs 83MB; ratio to raw data size =
> 5.854580272727273
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 256
>     [junit]   Duration = 445ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 10 ms for 1 collections
>     [junit]   Approx allocation = 399MB vs 83MB; ratio to raw data size =
> 4.764038727272728
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 256) partitions = 1024
>     [junit]   Duration = 403ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 10 ms for 1 collections
>     [junit]   Approx allocation = 356MB vs 83MB; ratio to raw data size =
> 4.242789363636364
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1
>     [junit]   Duration = 3530ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 104 ms for 30 collections
>     [junit]   Approx allocation = 9261MB vs 83MB; ratio to raw data size =
> 110.35432245454545
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 16
>     [junit]   Duration = 231ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 17 ms for 2 collections
>     [junit]   Approx allocation = 722MB vs 83MB; ratio to raw data size =
> 8.607574
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 256
>     [junit]   Duration = 205ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 11 ms for 1 collections
>     [junit]   Approx allocation = 411MB vs 83MB; ratio to raw data size =
> 4.901291545454545
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 256) partitions = 1024
>     [junit]   Duration = 343ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 14 ms for 1 collections
>     [junit]   Approx allocation = 362MB vs 83MB; ratio to raw data size =
> 4.321081454545454
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1
>     [junit]   Duration = 1039ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 63 ms for 5 collections
>     [junit]   Approx allocation = 597MB vs 267MB; ratio to raw data size =
> 2.238462485714286
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 16
>     [junit]   Duration = 1285ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 50 ms for 4 collections
>     [junit]   Approx allocation = 687MB vs 267MB; ratio to raw data size =
> 2.5740599428571427
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 256
>     [junit]   Duration = 871ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 50 ms for 4 collections
>     [junit]   Approx allocation = 576MB vs 267MB; ratio to raw data size =
> 2.158096
>     [junit]
>     [junit] Threads = 1 elements = 250000 (of size 1024) partitions = 1024
>     [junit]   Duration = 624ms maxConcurrency = 1
>     [junit]   GC for PS Scavenge: 47 ms for 4 collections
>     [junit]   Approx allocation = 531MB vs 267MB; ratio to raw data size =
> 1.9889662
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1
>     [junit]   Duration = 3204ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 121 ms for 35 collections
>     [junit]   Approx allocation = 10597MB vs 267MB; ratio to raw data size =
> 39.68819474285714
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 16
>     [junit]   Duration = 265ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 30 ms for 3 collections
>     [junit]   Approx allocation = 508MB vs 267MB; ratio to raw data size =
> 1.9032105142857143
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 256
>     [junit]   Duration = 323ms maxConcurrency = 100
>     [junit]   GC for PS Scavenge: 46 ms for 4 collections
>     [junit]   Approx allocation = 582MB vs 267MB; ratio to raw data size =
> 2.180992342857143
>     [junit]
>     [junit] Threads = 100 elements = 250000 (of size 1024) partitions = 1024
>     [junit]   Duration = 434ms maxConcurrency = 99
>     [junit]   GC for PS Scavenge: 66 ms for 4 collections
>     [junit]   Approx allocation = 544MB vs 267MB; ratio to raw data size =
> 2.039445542857143
>     [junit]
>     [junit] --------------------------------------------------
>     [junit] ==================================================
>
> On Jul 12, 2014, at 9:48 PM, graham sanderson <gra...@vast.com> wrote:
>
> Running on 2.0.5 we’ve noticed occasional MASSIVE spikes in memory
> allocation that happen under write load, but way out of proportion to the
> amount of data actually being written.
>
> My suspicion is the problem is related to hinted handoff, and basically
> follows the “some sort of non trivial GC pause on one node (probably caused
> by promotion failure of memtable due to slow/eventual fragmentation of
> tenured gen) causes other nodes to start writing hints for that node, and
> the hinting turns out to be actually much worse memory wise than the writes
> themselves.
>
> Obviously there is the JIRA issue in 3.0 to not store hints in a table, but
> in the meanwhile I thought I’d take a quick peek.
>
> I’m in the process of writing a test to confirm what I think is happening,
> but I thought I’d throw this out there anyway for discussion in the context
> of 2.1 changes.
>
> ====
>
> In this case, we are adding a single column (of small delta data) to a lot
> of partitions at once (lets say a hundred thousand partitions for each node
> after replication) pretty fast with not a lot of data for each probably a
> few hundred bytes max… this really puts very little strain on things at all.
> However if we are hinting, then ALL these updates for one target node end up
> going to the same partition in system.hints. My assumption that this
> combined with SnapTree is a huge problem. Whilst the snap tree clone itself
> may be quick, we are essentially appending each time, so there is probably a
> huge amount of rebalancing going on along with the lazy copy on write and
> the inherent concurrent race waste. I am guestimating that we are actually
> allocating several orders of magnitude more memory than doing the actual
> regular inserts does. The fact that our H/W can handle a very very high
> concurrent write volume probably makes the problem much worse.
>
> So my first thought, was fine, why not just change hinted handoff to shard
> (an extra partition key column along with the existing node uuid) based on
> say a fixed length 8 bit hash of the original row mutation partition key.
> This would spread the load some and probably solve the problem, but would
> require a schema change to system.hints which is annoying. That said,
> appending (sorted) data concurrently to the same partition is the sort of
> thing that many people might do, though probably not as fast as is done in
> this hinting case, so probably needs to be better anyway.
>
> So searching, I came across
> https://issues.apache.org/jira/browse/CASSANDRA-6271 for 2.1 which replaces
> SnapTree… so my final question is (and I will try to test this on 2.1
> soon).. do we expect that this change will probably make the hint problem
> moot anyways? i.e. heavy concurrent write to the same partition key is AOK,
> or does it make sense to shard hints until 3.x makes everything groovy.
>
> Thanks,
>
> Graham.
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced

Re: Hinted Handoff/Atomic Updates & SnapTree replacement with BTree

Reply via email to