Re: [go-nuts] Re: Can't explain Golang benchmarks

Zohaib Sibte Hassan Mon, 17 Jul 2017 13:15:06 -0700

Hey Jesper,

 Thanks for such an insightful reply. So the scenario I am trying to 
implement is basically a chat scenario. With huge chat rooms (thousands of 
people) situation, all the listener threads are going to run into this 
situation (which is exactly what I am doing here). To elaborate a little 
more, right now each user is a chat room is a coroutine waiting for a 
message on a channel (shameless plug project is available on 
http://raspchat.com/). Now this setup is running on a limited memory 
RaspberryPi with 512MB of RAM and few cores. What I noticed as channels 
were flooded by people on 4chan was excessive usage or CPU and memory, 
which led me to believe that channels are probably the bottle neck. So yes 
I am running it on shared memory system for now and disruptor pattern 
sounds sounds like an interesting approach I would love to try.


Thanks for the idea, and I will follow up with my findings here

On Monday, July 17, 2017 at 10:23:04 AM UTC-7, Jesper Louis Andersen wrote:
>
> Your benchmarks are not really *doing* something once they get data. This 
> means that eventual conflicts on (internal or external) locks are going to 
> be hit all the time. In turn, you are likely to measure the performance of 
> your hardware in the event of excessive lock contention (this hypothesis 
> can be verified: block profiling).
>
> Were you to process something before going back to the barrier or channel, 
> then you are less likely to hit the problem. Though you should be aware 
> sometimes your cores will coordinate (because the work they do will tend 
> them to come back to the lock at the same point in time - a similar problem 
> to TCP incast).
>
> As you go from a single core to multiple cores, an important event happens 
> in the system. In a single-core system, you don't need the hardware to 
> coordinate. This is much faster than the case where the hardware might need 
> to tell other physical cores about what is going on. Especially under 
> contention. In short, there is a gap between the single-core system and a 
> multicore system due to communication overhead. In high-performance 
> computing a common mistake is to measure the *same* implementation in the 
> serial and parallel case, even though the serial case can be made to avoid 
> taking locks and so on. A fair benchmark would compare the fastest 
> single-core implementation with the fastest multi-core ditto.
>
> The upshot of having access to multiple cores is that you can get them all 
> to do work and this can lead to a substantial speedup. But there has to be 
> work to do. Your benchmark has almost no work to do, so most of the cores 
> are just competing against each other.
>
> For a PubSub implementation you need to think about the effort involved in 
> the implementation. Up to a point, I think a channel-based solution, or a 
> barrier based solution is adequate. For systems where you need low latency, 
> go for a pattern like the Disruptor pattern[0], which should be doable in 
> Go. This pattern works by a ring-buffer of events and two types of "clock 
> hands". One hand is the point of insertion of a new event, the other 
> hand(s) track where the readers currently are in their processing. By 
> tracking the hands through atomics, you can get messaging overhead down by 
> a lot. The trick is that each thread writes to its own hand and other 
> threads can read an eventually consistent snapshot. As long as the buffer 
> doesn't fill up completely, the eventually consistent view of an ever 
> increasing counter is enough. Getting an older value is not a big deal, as 
> long as the value isn't undefined.
>
> The problem with channels are that they are, inherently, one-shot. If a 
> message is consumed, it is gone from the channel. In a PubSub solution, you 
> really want multiple readers process the same (immutable) data and flag it 
> when it is all consumed. Message brokers and SMTP servers often optimize 
> delivery of this kind by keeping the payload separate from the header 
> (envelope): better not pay for the contents more than once. But you still 
> have to send a message to each subscriber, and this gets rather expensive. 
> The Disruptor pattern solves this.
>
> On the other hand, the Disruptor only works if you happen to have memory 
> shared. In large distributed systems, the pattern tend to break down, and 
> other methods must be used. For instance a cluster of disruptors. But then 
> keeping consistency in check becomes a challenge of the fun kind :)
>
> [0] http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf - 
> Thompson, Farley, Barker, Gee, Steward, 2011
>
> On Mon, Jul 17, 2017 at 7:46 AM Zohaib Sibte Hassan <zohaib...@gmail.com 
> <javascript:>> wrote:
>
>> Awesome :) got similar results. So channels are slower for pubsub. I am 
>> surprised so far, with the efficiency going down as the cores go up (heck 
>> reminds me of that Node.js single core meme). Please tell me I am wrong 
>> here, is there any other more efficient approach? 
>>
>> On Sunday, July 16, 2017 at 8:06:44 PM UTC-7, peterGo wrote:
>>>
>>> Your latest benchmarks are invalid.
>>>
>>> In your benchmarks, replace b.StartTimer() with b.ResetTimer().
>>>
>>> Simply run benchmarks. Don't run race detectors. Don't run profilers. 
>>> For example,
>>>
>>> $ go version
>>> go version devel +504deee Sun Jul 16 03:57:11 2017 +0000 linux/amd64
>>> $ go test -run=! -bench=. -benchmem -cpu=1,2,4,8 pubsub_test.go
>>> goos: linux
>>> goarch: amd64
>>> BenchmarkPubSubPrimitiveChannelsMultiple            20000         86328 
>>> ns/op          61 B/op           2 allocs/op
>>> BenchmarkPubSubPrimitiveChannelsMultiple-2          30000         50844 
>>> ns/op          54 B/op           2 allocs/op
>>> BenchmarkPubSubPrimitiveChannelsMultiple-4          10000        112833 
>>> ns/op          83 B/op           2 allocs/op
>>> BenchmarkPubSubPrimitiveChannelsMultiple-8          10000        160011 
>>> ns/op          88 B/op           2 allocs/op
>>> BenchmarkPubSubWaitGroupMultiple                   100000         21231 
>>> ns/op          40 B/op           2 allocs/op
>>> BenchmarkPubSubWaitGroupMultiple-2                  10000        107165 
>>> ns/op          46 B/op           2 allocs/op
>>> BenchmarkPubSubWaitGroupMultiple-4                  20000         73235 
>>> ns/op          43 B/op           2 allocs/op
>>> BenchmarkPubSubWaitGroupMultiple-8                  20000         82917 
>>> ns/op          42 B/op           2 allocs/op
>>> PASS
>>> ok      command-line-arguments    15.481s
>>> $
>>>
>>> Peter
>>>
>>> On Sunday, July 16, 2017 at 9:51:38 PM UTC-4, Zohaib Sibte Hassan wrote:
>>>>
>>>> Thanks for pointing issues out I updated my code to get rid of race 
>>>> conditions (nothing critical I was always doing reader-writer race). 
>>>> Anyhow 
>>>> I updated my code on 
>>>> https://gist.github.com/maxpert/f3c405c516ba2d4c8aa8b0695e0e054e. 
>>>> Still doesn't explain the new results:
>>>>
>>>> $> go test -race -run=! -bench=. -benchmem -cpu=1,2,4,8 
>>>> -cpuprofile=cpu.out -memprofile=mem.out pubsub_test.go
>>>> BenchmarkPubSubPrimitiveChannelsMultiple          50  21121694 ns/op   
>>>>  8515 B/op      39 allocs/op
>>>> BenchmarkPubSubPrimitiveChannelsMultiple-2       100  19302372 ns/op   
>>>>  4277 B/op      20 allocs/op
>>>> BenchmarkPubSubPrimitiveChannelsMultiple-4        50  22674769 ns/op   
>>>>  8182 B/op      35 allocs/op
>>>> BenchmarkPubSubPrimitiveChannelsMultiple-8        50  21201533 ns/op   
>>>>  8469 B/op      38 allocs/op
>>>> BenchmarkPubSubWaitGroupMultiple                3000    501804 ns/op   
>>>>    63 B/op       2 allocs/op
>>>> BenchmarkPubSubWaitGroupMultiple-2               200  15417944 ns/op   
>>>>   407 B/op       6 allocs/op
>>>> BenchmarkPubSubWaitGroupMultiple-4               300   5010273 ns/op   
>>>>   231 B/op       4 allocs/op
>>>> BenchmarkPubSubWaitGroupMultiple-8               200   5444634 ns/op   
>>>>   334 B/op       5 allocs/op
>>>> PASS
>>>> ok   command-line-arguments 21.775s
>>>>
>>>> So far my testing shows channels are slower for pubsub scenario. I 
>>>> tried looking into pprof dumps of memory and CPU and it's not making sense 
>>>> to me. What am I missing here?
>>>>
>>>> On Sunday, July 16, 2017 at 10:27:04 AM UTC-7, peterGo wrote:
>>>>>
>>>>> When you have data races the results are undefined.
>>>>>
>>>>> $ go version
>>>>> go version devel +dd81c37 Sat Jul 15 05:43:45 2017 +0000 linux/amd64
>>>>> $ go test -race -run=! -bench=. -benchmem -cpu=1,2,4,8 pubsub_test.go
>>>>> ==================
>>>>> WARNING: DATA RACE
>>>>> Read at 0x00c4200140c0 by goroutine 18:
>>>>>   command-line-arguments.BenchmarkPubSubPrimitiveChannelsMultiple()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:59 +0x51d
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.(*B).run1.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>>
>>>>> Previous write at 0x00c4200140c0 by goroutine 57:
>>>>>   [failed to restore the stack]
>>>>>
>>>>> Goroutine 18 (running) created at:
>>>>>   testing.(*B).run1()
>>>>>       /home/peter/go/src/testing/benchmark.go:207 +0x8c
>>>>>   testing.(*B).Run()
>>>>>       /home/peter/go/src/testing/benchmark.go:513 +0x482
>>>>>   testing.runBenchmarks.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:417 +0xa7
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.runBenchmarks()
>>>>>       /home/peter/go/src/testing/benchmark.go:423 +0x86d
>>>>>   testing.(*M).Run()
>>>>>       /home/peter/go/src/testing/testing.go:928 +0x51e
>>>>>   main.main()
>>>>>       command-line-arguments/_test/_testmain.go:46 +0x1d3
>>>>>
>>>>> Goroutine 57 (finished) created at:
>>>>>   command-line-arguments.BenchmarkPubSubPrimitiveChannelsMultiple()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:40 +0x290
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.(*B).run1.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>> ==================
>>>>> --- FAIL: BenchmarkPubSubPrimitiveChannelsMultiple
>>>>>     benchmark.go:147: race detected during execution of benchmark
>>>>> ==================
>>>>> WARNING: DATA RACE
>>>>> Read at 0x00c42000c030 by goroutine 1079:
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:76 +0x9e
>>>>>
>>>>> Previous write at 0x00c42000c030 by goroutine 7:
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:101 +0x475
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.(*B).run1.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>>
>>>>> Goroutine 1079 (running) created at:
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.(*B).run1.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>>
>>>>> Goroutine 7 (running) created at:
>>>>>   testing.(*B).run1()
>>>>>       /home/peter/go/src/testing/benchmark.go:207 +0x8c
>>>>>   testing.(*B).Run()
>>>>>       /home/peter/go/src/testing/benchmark.go:513 +0x482
>>>>>   testing.runBenchmarks.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:417 +0xa7
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.runBenchmarks()
>>>>>       /home/peter/go/src/testing/benchmark.go:423 +0x86d
>>>>>   testing.(*M).Run()
>>>>>       /home/peter/go/src/testing/testing.go:928 +0x51e
>>>>>   main.main()
>>>>>       command-line-arguments/_test/_testmain.go:46 +0x1d3
>>>>> ==================
>>>>> ==================
>>>>> WARNING: DATA RACE
>>>>> Write at 0x00c42000c030 by goroutine 7:
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:101 +0x475
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.(*B).run1.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>>
>>>>> Previous read at 0x00c42000c030 by goroutine 1078:
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:76 +0x9e
>>>>>
>>>>> Goroutine 7 (running) created at:
>>>>>   testing.(*B).run1()
>>>>>       /home/peter/go/src/testing/benchmark.go:207 +0x8c
>>>>>   testing.(*B).Run()
>>>>>       /home/peter/go/src/testing/benchmark.go:513 +0x482
>>>>>   testing.runBenchmarks.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:417 +0xa7
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.runBenchmarks()
>>>>>       /home/peter/go/src/testing/benchmark.go:423 +0x86d
>>>>>   testing.(*M).Run()
>>>>>       /home/peter/go/src/testing/testing.go:928 +0x51e
>>>>>   main.main()
>>>>>       command-line-arguments/_test/_testmain.go:46 +0x1d3
>>>>>
>>>>> Goroutine 1078 (running) created at:
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.(*B).run1.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>> ==================
>>>>> ==================
>>>>> WARNING: DATA RACE
>>>>> Read at 0x00c4200140c8 by goroutine 7:
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:109 +0x51d
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.(*B).run1.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>>
>>>>> Previous write at 0x00c4200140c8 by goroutine 175:
>>>>>   sync/atomic.AddInt64()
>>>>>       /home/peter/go/src/runtime/race_amd64.s:276 +0xb
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:88 +0x19a
>>>>>
>>>>> Goroutine 7 (running) created at:
>>>>>   testing.(*B).run1()
>>>>>       /home/peter/go/src/testing/benchmark.go:207 +0x8c
>>>>>   testing.(*B).Run()
>>>>>       /home/peter/go/src/testing/benchmark.go:513 +0x482
>>>>>   testing.runBenchmarks.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:417 +0xa7
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.runBenchmarks()
>>>>>       /home/peter/go/src/testing/benchmark.go:423 +0x86d
>>>>>   testing.(*M).Run()
>>>>>       /home/peter/go/src/testing/testing.go:928 +0x51e
>>>>>   main.main()
>>>>>       command-line-arguments/_test/_testmain.go:46 +0x1d3
>>>>>
>>>>> Goroutine 175 (finished) created at:
>>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
>>>>>   testing.(*B).runN()
>>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>>   testing.(*B).run1.func1()
>>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>> ==================
>>>>> --- FAIL: BenchmarkPubSubWaitGroupMultiple
>>>>>     benchmark.go:147: race detected during execution of benchmark
>>>>> FAIL
>>>>> exit status 1
>>>>> FAIL    command-line-arguments    0.726s
>>>>> $
>>>>>
>>>>> Peter
>>>>>
>>>>> On Sunday, July 16, 2017 at 10:20:21 AM UTC-4, Zohaib Sibte Hassan 
>>>>> wrote:
>>>>>>
>>>>>> I have been spending my day over implementing an efficient PubSub 
>>>>>> system. I had implemented one before using channels, and I wanted to 
>>>>>> benchmark that against sync.Cond. Here is the quick and dirty test that 
>>>>>> I 
>>>>>> put together 
>>>>>> https://gist.github.com/maxpert/f3c405c516ba2d4c8aa8b0695e0e054e. 
>>>>>> Now my confusion starts when I change GOMAXPROCS to test how it would 
>>>>>> perform on my age old Raspberry Pi. Here are results:
>>>>>>
>>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=8 go 
>>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out 
>>>>>> -benchmem
>>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-8     10000    165419 ns/op   
>>>>>>    92 B/op       2 allocs/op
>>>>>> BenchmarkPubSubWaitGroupMultiple-8             10000    204685 ns/op   
>>>>>>    53 B/op       2 allocs/op
>>>>>> PASS
>>>>>> ok   sibte.so/rascore 3.749s
>>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=4 go 
>>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out 
>>>>>> -benchmem
>>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-4     20000    101704 ns/op   
>>>>>>    60 B/op       2 allocs/op
>>>>>> BenchmarkPubSubWaitGroupMultiple-4             10000    204039 ns/op   
>>>>>>    52 B/op       2 allocs/op
>>>>>> PASS
>>>>>> ok   sibte.so/rascore 5.087s
>>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=2 go 
>>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out 
>>>>>> -benchmem
>>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-2     30000     51255 ns/op   
>>>>>>    54 B/op       2 allocs/op
>>>>>> BenchmarkPubSubWaitGroupMultiple-2             20000     60871 ns/op   
>>>>>>    43 B/op       2 allocs/op
>>>>>> PASS
>>>>>> ok   sibte.so/rascore 4.022s
>>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=1 go 
>>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out 
>>>>>> -benchmem
>>>>>> BenchmarkPubSubPrimitiveChannelsMultiple   20000     79534 ns/op     
>>>>>>  61 B/op       2 allocs/op
>>>>>> BenchmarkPubSubWaitGroupMultiple          100000     19066 ns/op     
>>>>>>  40 B/op       2 allocs/op
>>>>>> PASS
>>>>>> ok   sibte.so/rascore 4.502s
>>>>>>
>>>>>>  I tried multiple times and results are consistent. I am using Go 
>>>>>> 1.8, Linux x64, 8GB RAM. I have multiple questions:
>>>>>>
>>>>>>
>>>>>>    - Why do channels perform worst than sync.Cond in single core 
>>>>>>    results? Context switching is same if anything it should perform 
>>>>>> worst.
>>>>>>    - As I increase the max procs the sync.Cond results go down which 
>>>>>>    might be explainable, but what is up with channels? 20k to 30k to 20k 
>>>>>> to 
>>>>>>     10k :( I have a i5 with 4 cores, so it should have peaked at 4 procs 
>>>>>> (pst. 
>>>>>>    I tried 3 as well it's consistent). 
>>>>>>
>>>>>>  I am still suspicious I am not making some kind of mistake in code. 
>>>>>> Any ideas?
>>>>>>
>>>>>> - Thanks
>>>>>>
>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Can't explain Golang benchmarks

Reply via email to