Re: [go-nuts] Re: Can't explain Golang benchmarks

Jesper Louis Andersen Mon, 17 Jul 2017 10:22:55 -0700

Your benchmarks are not really *doing* something once they get data. This
means that eventual conflicts on (internal or external) locks are going to
be hit all the time. In turn, you are likely to measure the performance of
your hardware in the event of excessive lock contention (this hypothesis
can be verified: block profiling).


Were you to process something before going back to the barrier or channel,
then you are less likely to hit the problem. Though you should be aware
sometimes your cores will coordinate (because the work they do will tend
them to come back to the lock at the same point in time - a similar problem
to TCP incast).

As you go from a single core to multiple cores, an important event happens
in the system. In a single-core system, you don't need the hardware to
coordinate. This is much faster than the case where the hardware might need
to tell other physical cores about what is going on. Especially under
contention. In short, there is a gap between the single-core system and a
multicore system due to communication overhead. In high-performance
computing a common mistake is to measure the *same* implementation in the
serial and parallel case, even though the serial case can be made to avoid
taking locks and so on. A fair benchmark would compare the fastest
single-core implementation with the fastest multi-core ditto.

The upshot of having access to multiple cores is that you can get them all
to do work and this can lead to a substantial speedup. But there has to be
work to do. Your benchmark has almost no work to do, so most of the cores
are just competing against each other.

For a PubSub implementation you need to think about the effort involved in
the implementation. Up to a point, I think a channel-based solution, or a
barrier based solution is adequate. For systems where you need low latency,
go for a pattern like the Disruptor pattern[0], which should be doable in
Go. This pattern works by a ring-buffer of events and two types of "clock
hands". One hand is the point of insertion of a new event, the other
hand(s) track where the readers currently are in their processing. By
tracking the hands through atomics, you can get messaging overhead down by
a lot. The trick is that each thread writes to its own hand and other
threads can read an eventually consistent snapshot. As long as the buffer
doesn't fill up completely, the eventually consistent view of an ever
increasing counter is enough. Getting an older value is not a big deal, as
long as the value isn't undefined.

The problem with channels are that they are, inherently, one-shot. If a
message is consumed, it is gone from the channel. In a PubSub solution, you
really want multiple readers process the same (immutable) data and flag it
when it is all consumed. Message brokers and SMTP servers often optimize
delivery of this kind by keeping the payload separate from the header
(envelope): better not pay for the contents more than once. But you still
have to send a message to each subscriber, and this gets rather expensive.
The Disruptor pattern solves this.

On the other hand, the Disruptor only works if you happen to have memory
shared. In large distributed systems, the pattern tend to break down, and
other methods must be used. For instance a cluster of disruptors. But then
keeping consistency in check becomes a challenge of the fun kind :)

[0] http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf -
Thompson, Farley, Barker, Gee, Steward, 2011

On Mon, Jul 17, 2017 at 7:46 AM Zohaib Sibte Hassan <zohaib.has...@gmail.com>
wrote:

> Awesome :) got similar results. So channels are slower for pubsub. I am
> surprised so far, with the efficiency going down as the cores go up (heck
> reminds me of that Node.js single core meme). Please tell me I am wrong
> here, is there any other more efficient approach?
>
> On Sunday, July 16, 2017 at 8:06:44 PM UTC-7, peterGo wrote:
>>
>> Your latest benchmarks are invalid.
>>
>> In your benchmarks, replace b.StartTimer() with b.ResetTimer().
>>
>> Simply run benchmarks. Don't run race detectors. Don't run profilers. For
>> example,
>>
>> $ go version
>> go version devel +504deee Sun Jul 16 03:57:11 2017 +0000 linux/amd64
>> $ go test -run=! -bench=. -benchmem -cpu=1,2,4,8 pubsub_test.go
>> goos: linux
>> goarch: amd64
>> BenchmarkPubSubPrimitiveChannelsMultiple            20000         86328
>> ns/op          61 B/op           2 allocs/op
>> BenchmarkPubSubPrimitiveChannelsMultiple-2          30000         50844
>> ns/op          54 B/op           2 allocs/op
>> BenchmarkPubSubPrimitiveChannelsMultiple-4          10000        112833
>> ns/op          83 B/op           2 allocs/op
>> BenchmarkPubSubPrimitiveChannelsMultiple-8          10000        160011
>> ns/op          88 B/op           2 allocs/op
>> BenchmarkPubSubWaitGroupMultiple                   100000         21231
>> ns/op          40 B/op           2 allocs/op
>> BenchmarkPubSubWaitGroupMultiple-2                  10000        107165
>> ns/op          46 B/op           2 allocs/op
>> BenchmarkPubSubWaitGroupMultiple-4                  20000         73235
>> ns/op          43 B/op           2 allocs/op
>> BenchmarkPubSubWaitGroupMultiple-8                  20000         82917
>> ns/op          42 B/op           2 allocs/op
>> PASS
>> ok      command-line-arguments    15.481s
>> $
>>
>> Peter
>>
>> On Sunday, July 16, 2017 at 9:51:38 PM UTC-4, Zohaib Sibte Hassan wrote:
>>>
>>> Thanks for pointing issues out I updated my code to get rid of race
>>> conditions (nothing critical I was always doing reader-writer race). Anyhow
>>> I updated my code on
>>> https://gist.github.com/maxpert/f3c405c516ba2d4c8aa8b0695e0e054e. Still
>>> doesn't explain the new results:
>>>
>>> $> go test -race -run=! -bench=. -benchmem -cpu=1,2,4,8
>>> -cpuprofile=cpu.out -memprofile=mem.out pubsub_test.go
>>> BenchmarkPubSubPrimitiveChannelsMultiple          50  21121694 ns/op
>>>  8515 B/op      39 allocs/op
>>> BenchmarkPubSubPrimitiveChannelsMultiple-2       100  19302372 ns/op
>>>  4277 B/op      20 allocs/op
>>> BenchmarkPubSubPrimitiveChannelsMultiple-4        50  22674769 ns/op
>>>  8182 B/op      35 allocs/op
>>> BenchmarkPubSubPrimitiveChannelsMultiple-8        50  21201533 ns/op
>>>  8469 B/op      38 allocs/op
>>> BenchmarkPubSubWaitGroupMultiple                3000    501804 ns/op
>>>    63 B/op       2 allocs/op
>>> BenchmarkPubSubWaitGroupMultiple-2               200  15417944 ns/op
>>>   407 B/op       6 allocs/op
>>> BenchmarkPubSubWaitGroupMultiple-4               300   5010273 ns/op
>>>   231 B/op       4 allocs/op
>>> BenchmarkPubSubWaitGroupMultiple-8               200   5444634 ns/op
>>>   334 B/op       5 allocs/op
>>> PASS
>>> ok   command-line-arguments 21.775s
>>>
>>> So far my testing shows channels are slower for pubsub scenario. I tried
>>> looking into pprof dumps of memory and CPU and it's not making sense to me.
>>> What am I missing here?
>>>
>>> On Sunday, July 16, 2017 at 10:27:04 AM UTC-7, peterGo wrote:
>>>>
>>>> When you have data races the results are undefined.
>>>>
>>>> $ go version
>>>> go version devel +dd81c37 Sat Jul 15 05:43:45 2017 +0000 linux/amd64
>>>> $ go test -race -run=! -bench=. -benchmem -cpu=1,2,4,8 pubsub_test.go
>>>> ==================
>>>> WARNING: DATA RACE
>>>> Read at 0x00c4200140c0 by goroutine 18:
>>>>   command-line-arguments.BenchmarkPubSubPrimitiveChannelsMultiple()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:59 +0x51d
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.(*B).run1.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>
>>>> Previous write at 0x00c4200140c0 by goroutine 57:
>>>>   [failed to restore the stack]
>>>>
>>>> Goroutine 18 (running) created at:
>>>>   testing.(*B).run1()
>>>>       /home/peter/go/src/testing/benchmark.go:207 +0x8c
>>>>   testing.(*B).Run()
>>>>       /home/peter/go/src/testing/benchmark.go:513 +0x482
>>>>   testing.runBenchmarks.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:417 +0xa7
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.runBenchmarks()
>>>>       /home/peter/go/src/testing/benchmark.go:423 +0x86d
>>>>   testing.(*M).Run()
>>>>       /home/peter/go/src/testing/testing.go:928 +0x51e
>>>>   main.main()
>>>>       command-line-arguments/_test/_testmain.go:46 +0x1d3
>>>>
>>>> Goroutine 57 (finished) created at:
>>>>   command-line-arguments.BenchmarkPubSubPrimitiveChannelsMultiple()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:40 +0x290
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.(*B).run1.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>> ==================
>>>> --- FAIL: BenchmarkPubSubPrimitiveChannelsMultiple
>>>>     benchmark.go:147: race detected during execution of benchmark
>>>> ==================
>>>> WARNING: DATA RACE
>>>> Read at 0x00c42000c030 by goroutine 1079:
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:76 +0x9e
>>>>
>>>> Previous write at 0x00c42000c030 by goroutine 7:
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:101 +0x475
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.(*B).run1.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>
>>>> Goroutine 1079 (running) created at:
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.(*B).run1.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>
>>>> Goroutine 7 (running) created at:
>>>>   testing.(*B).run1()
>>>>       /home/peter/go/src/testing/benchmark.go:207 +0x8c
>>>>   testing.(*B).Run()
>>>>       /home/peter/go/src/testing/benchmark.go:513 +0x482
>>>>   testing.runBenchmarks.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:417 +0xa7
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.runBenchmarks()
>>>>       /home/peter/go/src/testing/benchmark.go:423 +0x86d
>>>>   testing.(*M).Run()
>>>>       /home/peter/go/src/testing/testing.go:928 +0x51e
>>>>   main.main()
>>>>       command-line-arguments/_test/_testmain.go:46 +0x1d3
>>>> ==================
>>>> ==================
>>>> WARNING: DATA RACE
>>>> Write at 0x00c42000c030 by goroutine 7:
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:101 +0x475
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.(*B).run1.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>
>>>> Previous read at 0x00c42000c030 by goroutine 1078:
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:76 +0x9e
>>>>
>>>> Goroutine 7 (running) created at:
>>>>   testing.(*B).run1()
>>>>       /home/peter/go/src/testing/benchmark.go:207 +0x8c
>>>>   testing.(*B).Run()
>>>>       /home/peter/go/src/testing/benchmark.go:513 +0x482
>>>>   testing.runBenchmarks.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:417 +0xa7
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.runBenchmarks()
>>>>       /home/peter/go/src/testing/benchmark.go:423 +0x86d
>>>>   testing.(*M).Run()
>>>>       /home/peter/go/src/testing/testing.go:928 +0x51e
>>>>   main.main()
>>>>       command-line-arguments/_test/_testmain.go:46 +0x1d3
>>>>
>>>> Goroutine 1078 (running) created at:
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.(*B).run1.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>> ==================
>>>> ==================
>>>> WARNING: DATA RACE
>>>> Read at 0x00c4200140c8 by goroutine 7:
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:109 +0x51d
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.(*B).run1.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>>
>>>> Previous write at 0x00c4200140c8 by goroutine 175:
>>>>   sync/atomic.AddInt64()
>>>>       /home/peter/go/src/runtime/race_amd64.s:276 +0xb
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:88 +0x19a
>>>>
>>>> Goroutine 7 (running) created at:
>>>>   testing.(*B).run1()
>>>>       /home/peter/go/src/testing/benchmark.go:207 +0x8c
>>>>   testing.(*B).Run()
>>>>       /home/peter/go/src/testing/benchmark.go:513 +0x482
>>>>   testing.runBenchmarks.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:417 +0xa7
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.runBenchmarks()
>>>>       /home/peter/go/src/testing/benchmark.go:423 +0x86d
>>>>   testing.(*M).Run()
>>>>       /home/peter/go/src/testing/testing.go:928 +0x51e
>>>>   main.main()
>>>>       command-line-arguments/_test/_testmain.go:46 +0x1d3
>>>>
>>>> Goroutine 175 (finished) created at:
>>>>   command-line-arguments.BenchmarkPubSubWaitGroupMultiple()
>>>>       /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6
>>>>   testing.(*B).runN()
>>>>       /home/peter/go/src/testing/benchmark.go:141 +0x12a
>>>>   testing.(*B).run1.func1()
>>>>       /home/peter/go/src/testing/benchmark.go:214 +0x6b
>>>> ==================
>>>> --- FAIL: BenchmarkPubSubWaitGroupMultiple
>>>>     benchmark.go:147: race detected during execution of benchmark
>>>> FAIL
>>>> exit status 1
>>>> FAIL    command-line-arguments    0.726s
>>>> $
>>>>
>>>> Peter
>>>>
>>>> On Sunday, July 16, 2017 at 10:20:21 AM UTC-4, Zohaib Sibte Hassan
>>>> wrote:
>>>>>
>>>>> I have been spending my day over implementing an efficient PubSub
>>>>> system. I had implemented one before using channels, and I wanted to
>>>>> benchmark that against sync.Cond. Here is the quick and dirty test that I
>>>>> put together
>>>>> https://gist.github.com/maxpert/f3c405c516ba2d4c8aa8b0695e0e054e. Now
>>>>> my confusion starts when I change GOMAXPROCS to test how it would perform
>>>>> on my age old Raspberry Pi. Here are results:
>>>>>
>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=8 go
>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out
>>>>> -benchmem
>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-8     10000    165419 ns/op
>>>>>    92 B/op       2 allocs/op
>>>>> BenchmarkPubSubWaitGroupMultiple-8             10000    204685 ns/op
>>>>>    53 B/op       2 allocs/op
>>>>> PASS
>>>>> ok   sibte.so/rascore 3.749s
>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=4 go
>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out
>>>>> -benchmem
>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-4     20000    101704 ns/op
>>>>>    60 B/op       2 allocs/op
>>>>> BenchmarkPubSubWaitGroupMultiple-4             10000    204039 ns/op
>>>>>    52 B/op       2 allocs/op
>>>>> PASS
>>>>> ok   sibte.so/rascore 5.087s
>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=2 go
>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out
>>>>> -benchmem
>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-2     30000     51255 ns/op
>>>>>    54 B/op       2 allocs/op
>>>>> BenchmarkPubSubWaitGroupMultiple-2             20000     60871 ns/op
>>>>>    43 B/op       2 allocs/op
>>>>> PASS
>>>>> ok   sibte.so/rascore 4.022s
>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=1 go
>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out
>>>>> -benchmem
>>>>> BenchmarkPubSubPrimitiveChannelsMultiple   20000     79534 ns/op
>>>>>  61 B/op       2 allocs/op
>>>>> BenchmarkPubSubWaitGroupMultiple          100000     19066 ns/op
>>>>>  40 B/op       2 allocs/op
>>>>> PASS
>>>>> ok   sibte.so/rascore 4.502s
>>>>>
>>>>>  I tried multiple times and results are consistent. I am using Go 1.8,
>>>>> Linux x64, 8GB RAM. I have multiple questions:
>>>>>
>>>>>
>>>>>    - Why do channels perform worst than sync.Cond in single core
>>>>>    results? Context switching is same if anything it should perform worst.
>>>>>    - As I increase the max procs the sync.Cond results go down which
>>>>>    might be explainable, but what is up with channels? 20k to 30k to 20k 
>>>>> to
>>>>>     10k :( I have a i5 with 4 cores, so it should have peaked at 4 procs 
>>>>> (pst.
>>>>>    I tried 3 as well it's consistent).
>>>>>
>>>>>  I am still suspicious I am not making some kind of mistake in code.
>>>>> Any ideas?
>>>>>
>>>>> - Thanks
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: Can't explain Golang benchmarks

Reply via email to