Hi Dave,

Just to follow up as I believe we've resolved this issue. 

TL;DR: Ever-growing maps with convoluted logic surrounded by locks lead to 
poor response times.

Your response prompted us to not write off what we had assumed was an 
insignificant increase in heap size (~100mb over 24 hours) by acknowledging 
that it's steady and ever-increasing, which is bad regardless of the rate.

That said, I was a little surprised that the GC STW time for 950mb vs 850mb 
had a 5.5ms increase in our mean response time, even at 35k QPS. That still 
didn't feel like it was "working as designed", so we decided to throw that 
correlation out as a red herring, at least temporarily. 

Once we had freed ourself of that correlation and focused on the increase 
in the heap size that we had at first written off, we found the source of 
the issue. We had 5 maps with a single Lock around their access, and one of 
them was ever-growing (don't judge, it was before my time!). Since one of 
the maps was ever-growing in size (with string keys), the lock took 
progressively longer to release. There were lots of things going on inside 
the lock. I didn't benchmark to find out exactly which aspect of it was 
holding the lock progressively longer. 

Ugh.

Fortunately upon investigation we determined that the ever-growing map 
didn't need to grow forever. There is a clear and logical time and place to 
delete elements. We refactored a bit to remove all but 2 of those maps, 
ensure neither of them grow very large, and put them local to a single 
goroutine. Simplified the logic all around. Signals to add, remove, and 
report on it come in on buffered channels to avoid locks. 

This change has been running for over 48 hours now. The heap/GC increase is 
gone, and so has our progressively increasing response time.

Sweet.

Thanks again!

Evan

On Monday, 12 December 2016 14:36:14 UTC-8, Evan Digby wrote:
>
> Hi Dave,
>
> Thanks for the insight. I'll look further into the heap creep. I'll also 
> try to get together a log of a run past 8 hours over the next day or so on 
> a machine we're not regularly restarting the go process on to see if it 
> shows anything different. 
>
> Thanks again,
>
> Evan
>
> On Monday, 12 December 2016 14:30:45 UTC-8, Dave Cheney wrote:
>>
>> Thanks for posting this. Once your application gets to about 1600-1900mb 
>> goal the GC pauses seem to stabilise at 3.3+3.2. It looks like the anount 
>> of memory in use is creeping up slowly, from around 850 to 950 over the 
>> span of the trace. GC sweep time is proportional to the size of the heap, 
>> so this is expected.
>>
>> On Tuesday, 13 December 2016 09:20:07 UTC+11, Evan Digby wrote:
>>>
>>> Hi Dave,
>>>
>>> Thanks for the reply. Attached is the full GCTRACE for ~8 hours of run. 
>>> I believe it shows the creep, but I'm not terribly familiar with the 
>>> gctrace output. Something that should probably change over the next day or 
>>> so! (my familiarity)
>>>
>>> Starts with:
>>>
>>> gc 1 @0.116s 0%: 0.079+0.69+0.23 ms clock, 0.55+0.33/1.5/0.63+1.6 ms cpu
>>> , 17->18->15 MB, 18 MB goal, 40 P
>>> gc 2 @0.117s 0%: 0.10+1.3+0.52 ms clock, 0.80+0.33/5.5/1.0+4.1 ms cpu, 
>>> 16->21->20 MB, 31 MB goal, 40 P
>>> gc 3 @0.126s 1%: 0.083+5.3+5.5 ms clock, 1.2+0/5.9/6.3+82 ms cpu, 36->49
>>> ->35 MB, 40 MB goal, 40 P
>>> gc 4 @0.230s 1%: 0.090+2.0+0.25 ms clock, 1.5+0.077/7.5/1.4+4.2 ms cpu, 
>>> 39->40->24 MB, 70 MB goal, 40 P
>>> gc 5 @0.271s 1%: 0.065+2.9+0.24 ms clock, 1.1+0.050/20/20+4.4 ms cpu, 38
>>> ->39->31 MB, 49 MB goal, 40 P
>>> gc 6 @0.348s 1%: 0.082+9.9+0.27 ms clock, 2.4+0.10/57/103+8.3 ms cpu, 55
>>> ->56->43 MB, 62 MB goal, 40 P
>>> gc 7 @0.477s 1%: 0.034+17+0.26 ms clock, 1.1+0.057/85/133+8.6 ms cpu, 80
>>> ->83->64 MB, 86 MB goal, 40 P
>>> gc 8 @0.684s 1%: 0.047+41+0.27 ms clock, 1.5+0/123/2.5+8.8 ms cpu, 122->
>>> 123->79 MB, 129 MB goal, 40 P
>>> gc 9 @0.975s 4%: 0.004+0+37 ms clock, 0.14+0/123/2.5+1188 ms cpu, 126->
>>> 126->107 MB, 126 MB goal, 40 P (forced)
>>> gc 10 @1.071s 4%: 0.035+61+0.26 ms clock, 1.1+17/231/563+8.4 ms cpu, 295
>>> ->295->201 MB, 296 MB goal, 40 P
>>>
>>> Ends with:
>>>
>>> gc 17212 @28778.765s 3%: 3.0+117+3.7 ms clock, 96+1582/1042/546+118 ms 
>>> cpu, 1792->1802->924 MB, 1837 MB goal, 40 P 
>>> gc 17213 @28780.258s 3%: 2.6+119+4.1 ms clock, 83+1544/996/539+133 ms 
>>> cpu, 1803->1815->929 MB, 1848 MB goal, 40 P 
>>> gc 17214 @28781.711s 3%: 2.9+142+3.7 ms clock, 94+1588/1088/370+121 ms 
>>> cpu, 1812->1825->936 MB, 1858 MB goal, 40 P 
>>> gc 17215 @28783.094s 3%: 3.2+107+3.3 ms clock, 104+1723/1061/452+108 ms 
>>> cpu, 1827->1838->945 MB, 1873 MB goal, 40 P 
>>> gc 17216 @28784.500s 3%: 3.0+98+2.9 ms clock, 96+1652/943/463+95 ms cpu, 
>>> 1844->1854->947 MB, 1891 MB goal, 40 P 
>>> gc 17217 @28785.928s 3%: 2.7+118+3.4 ms clock, 87+1558/991/451+110 ms 
>>> cpu, 1848->1861->948 MB, 1895 MB goal, 40 P 
>>> gc 17218 @28787.284s 3%: 2.8+115+4.2 ms clock, 91+1588/932/461+135 ms 
>>> cpu, 1850->1862->960 MB, 1897 MB goal, 40 P 
>>> gc 17219 @28788.667s 3%: 3.1+116+3.3 ms clock, 100+1421/1101/466+107 ms 
>>> cpu, 1873->1888->961 MB, 1921 MB goal, 40 P 
>>> gc 17220 @28790.059s 3%: 3.0+112+3.4 ms clock, 96+1663/1065/505+109 ms 
>>> cpu, 1876->1888->967 MB, 1923 MB goal, 40 P 
>>> gc 17221 @28791.435s 3%: 3.3+99+3.2 ms clock, 105+1685/986/420+104 ms 
>>> cpu, 1886->1897->968 MB, 1934 MB goal, 40 P
>>>
>>>
>>> Thank you!
>>>
>>> Evan
>>>
>>> On Monday, 12 December 2016 13:42:32 UTC-8, Dave Cheney wrote:
>>>>
>>>> Can you try turning on GODEBUG=gctrace=1 and confirming that the pause 
>>>> times are increasing. If you could post the entire output that would be 
>>>> most helpful. 
>>>>
>>>> Thanks
>>>>
>>>> Dave
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to