Hey Jesper, Thanks for such an insightful reply. So the scenario I am trying to implement is basically a chat scenario. With huge chat rooms (thousands of people) situation, all the listener threads are going to run into this situation (which is exactly what I am doing here). To elaborate a little more, right now each user is a chat room is a coroutine waiting for a message on a channel (shameless plug project is available on http://raspchat.com/). Now this setup is running on a limited memory RaspberryPi with 512MB of RAM and few cores. What I noticed as channels were flooded by people on 4chan was excessive usage or CPU and memory, which led me to believe that channels are probably the bottle neck. So yes I am running it on shared memory system for now and disruptor pattern sounds sounds like an interesting approach I would love to try.
Thanks for the idea, and I will follow up with my findings here On Monday, July 17, 2017 at 10:23:04 AM UTC-7, Jesper Louis Andersen wrote: > > Your benchmarks are not really *doing* something once they get data. This > means that eventual conflicts on (internal or external) locks are going to > be hit all the time. In turn, you are likely to measure the performance of > your hardware in the event of excessive lock contention (this hypothesis > can be verified: block profiling). > > Were you to process something before going back to the barrier or channel, > then you are less likely to hit the problem. Though you should be aware > sometimes your cores will coordinate (because the work they do will tend > them to come back to the lock at the same point in time - a similar problem > to TCP incast). > > As you go from a single core to multiple cores, an important event happens > in the system. In a single-core system, you don't need the hardware to > coordinate. This is much faster than the case where the hardware might need > to tell other physical cores about what is going on. Especially under > contention. In short, there is a gap between the single-core system and a > multicore system due to communication overhead. In high-performance > computing a common mistake is to measure the *same* implementation in the > serial and parallel case, even though the serial case can be made to avoid > taking locks and so on. A fair benchmark would compare the fastest > single-core implementation with the fastest multi-core ditto. > > The upshot of having access to multiple cores is that you can get them all > to do work and this can lead to a substantial speedup. But there has to be > work to do. Your benchmark has almost no work to do, so most of the cores > are just competing against each other. > > For a PubSub implementation you need to think about the effort involved in > the implementation. Up to a point, I think a channel-based solution, or a > barrier based solution is adequate. For systems where you need low latency, > go for a pattern like the Disruptor pattern[0], which should be doable in > Go. This pattern works by a ring-buffer of events and two types of "clock > hands". One hand is the point of insertion of a new event, the other > hand(s) track where the readers currently are in their processing. By > tracking the hands through atomics, you can get messaging overhead down by > a lot. The trick is that each thread writes to its own hand and other > threads can read an eventually consistent snapshot. As long as the buffer > doesn't fill up completely, the eventually consistent view of an ever > increasing counter is enough. Getting an older value is not a big deal, as > long as the value isn't undefined. > > The problem with channels are that they are, inherently, one-shot. If a > message is consumed, it is gone from the channel. In a PubSub solution, you > really want multiple readers process the same (immutable) data and flag it > when it is all consumed. Message brokers and SMTP servers often optimize > delivery of this kind by keeping the payload separate from the header > (envelope): better not pay for the contents more than once. But you still > have to send a message to each subscriber, and this gets rather expensive. > The Disruptor pattern solves this. > > On the other hand, the Disruptor only works if you happen to have memory > shared. In large distributed systems, the pattern tend to break down, and > other methods must be used. For instance a cluster of disruptors. But then > keeping consistency in check becomes a challenge of the fun kind :) > > [0] http://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdf - > Thompson, Farley, Barker, Gee, Steward, 2011 > > On Mon, Jul 17, 2017 at 7:46 AM Zohaib Sibte Hassan <zohaib...@gmail.com > <javascript:>> wrote: > >> Awesome :) got similar results. So channels are slower for pubsub. I am >> surprised so far, with the efficiency going down as the cores go up (heck >> reminds me of that Node.js single core meme). Please tell me I am wrong >> here, is there any other more efficient approach? >> >> On Sunday, July 16, 2017 at 8:06:44 PM UTC-7, peterGo wrote: >>> >>> Your latest benchmarks are invalid. >>> >>> In your benchmarks, replace b.StartTimer() with b.ResetTimer(). >>> >>> Simply run benchmarks. Don't run race detectors. Don't run profilers. >>> For example, >>> >>> $ go version >>> go version devel +504deee Sun Jul 16 03:57:11 2017 +0000 linux/amd64 >>> $ go test -run=! -bench=. -benchmem -cpu=1,2,4,8 pubsub_test.go >>> goos: linux >>> goarch: amd64 >>> BenchmarkPubSubPrimitiveChannelsMultiple 20000 86328 >>> ns/op 61 B/op 2 allocs/op >>> BenchmarkPubSubPrimitiveChannelsMultiple-2 30000 50844 >>> ns/op 54 B/op 2 allocs/op >>> BenchmarkPubSubPrimitiveChannelsMultiple-4 10000 112833 >>> ns/op 83 B/op 2 allocs/op >>> BenchmarkPubSubPrimitiveChannelsMultiple-8 10000 160011 >>> ns/op 88 B/op 2 allocs/op >>> BenchmarkPubSubWaitGroupMultiple 100000 21231 >>> ns/op 40 B/op 2 allocs/op >>> BenchmarkPubSubWaitGroupMultiple-2 10000 107165 >>> ns/op 46 B/op 2 allocs/op >>> BenchmarkPubSubWaitGroupMultiple-4 20000 73235 >>> ns/op 43 B/op 2 allocs/op >>> BenchmarkPubSubWaitGroupMultiple-8 20000 82917 >>> ns/op 42 B/op 2 allocs/op >>> PASS >>> ok command-line-arguments 15.481s >>> $ >>> >>> Peter >>> >>> On Sunday, July 16, 2017 at 9:51:38 PM UTC-4, Zohaib Sibte Hassan wrote: >>>> >>>> Thanks for pointing issues out I updated my code to get rid of race >>>> conditions (nothing critical I was always doing reader-writer race). >>>> Anyhow >>>> I updated my code on >>>> https://gist.github.com/maxpert/f3c405c516ba2d4c8aa8b0695e0e054e. >>>> Still doesn't explain the new results: >>>> >>>> $> go test -race -run=! -bench=. -benchmem -cpu=1,2,4,8 >>>> -cpuprofile=cpu.out -memprofile=mem.out pubsub_test.go >>>> BenchmarkPubSubPrimitiveChannelsMultiple 50 21121694 ns/op >>>> 8515 B/op 39 allocs/op >>>> BenchmarkPubSubPrimitiveChannelsMultiple-2 100 19302372 ns/op >>>> 4277 B/op 20 allocs/op >>>> BenchmarkPubSubPrimitiveChannelsMultiple-4 50 22674769 ns/op >>>> 8182 B/op 35 allocs/op >>>> BenchmarkPubSubPrimitiveChannelsMultiple-8 50 21201533 ns/op >>>> 8469 B/op 38 allocs/op >>>> BenchmarkPubSubWaitGroupMultiple 3000 501804 ns/op >>>> 63 B/op 2 allocs/op >>>> BenchmarkPubSubWaitGroupMultiple-2 200 15417944 ns/op >>>> 407 B/op 6 allocs/op >>>> BenchmarkPubSubWaitGroupMultiple-4 300 5010273 ns/op >>>> 231 B/op 4 allocs/op >>>> BenchmarkPubSubWaitGroupMultiple-8 200 5444634 ns/op >>>> 334 B/op 5 allocs/op >>>> PASS >>>> ok command-line-arguments 21.775s >>>> >>>> So far my testing shows channels are slower for pubsub scenario. I >>>> tried looking into pprof dumps of memory and CPU and it's not making sense >>>> to me. What am I missing here? >>>> >>>> On Sunday, July 16, 2017 at 10:27:04 AM UTC-7, peterGo wrote: >>>>> >>>>> When you have data races the results are undefined. >>>>> >>>>> $ go version >>>>> go version devel +dd81c37 Sat Jul 15 05:43:45 2017 +0000 linux/amd64 >>>>> $ go test -race -run=! -bench=. -benchmem -cpu=1,2,4,8 pubsub_test.go >>>>> ================== >>>>> WARNING: DATA RACE >>>>> Read at 0x00c4200140c0 by goroutine 18: >>>>> command-line-arguments.BenchmarkPubSubPrimitiveChannelsMultiple() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:59 +0x51d >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.(*B).run1.func1() >>>>> /home/peter/go/src/testing/benchmark.go:214 +0x6b >>>>> >>>>> Previous write at 0x00c4200140c0 by goroutine 57: >>>>> [failed to restore the stack] >>>>> >>>>> Goroutine 18 (running) created at: >>>>> testing.(*B).run1() >>>>> /home/peter/go/src/testing/benchmark.go:207 +0x8c >>>>> testing.(*B).Run() >>>>> /home/peter/go/src/testing/benchmark.go:513 +0x482 >>>>> testing.runBenchmarks.func1() >>>>> /home/peter/go/src/testing/benchmark.go:417 +0xa7 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.runBenchmarks() >>>>> /home/peter/go/src/testing/benchmark.go:423 +0x86d >>>>> testing.(*M).Run() >>>>> /home/peter/go/src/testing/testing.go:928 +0x51e >>>>> main.main() >>>>> command-line-arguments/_test/_testmain.go:46 +0x1d3 >>>>> >>>>> Goroutine 57 (finished) created at: >>>>> command-line-arguments.BenchmarkPubSubPrimitiveChannelsMultiple() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:40 +0x290 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.(*B).run1.func1() >>>>> /home/peter/go/src/testing/benchmark.go:214 +0x6b >>>>> ================== >>>>> --- FAIL: BenchmarkPubSubPrimitiveChannelsMultiple >>>>> benchmark.go:147: race detected during execution of benchmark >>>>> ================== >>>>> WARNING: DATA RACE >>>>> Read at 0x00c42000c030 by goroutine 1079: >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:76 +0x9e >>>>> >>>>> Previous write at 0x00c42000c030 by goroutine 7: >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:101 +0x475 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.(*B).run1.func1() >>>>> /home/peter/go/src/testing/benchmark.go:214 +0x6b >>>>> >>>>> Goroutine 1079 (running) created at: >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.(*B).run1.func1() >>>>> /home/peter/go/src/testing/benchmark.go:214 +0x6b >>>>> >>>>> Goroutine 7 (running) created at: >>>>> testing.(*B).run1() >>>>> /home/peter/go/src/testing/benchmark.go:207 +0x8c >>>>> testing.(*B).Run() >>>>> /home/peter/go/src/testing/benchmark.go:513 +0x482 >>>>> testing.runBenchmarks.func1() >>>>> /home/peter/go/src/testing/benchmark.go:417 +0xa7 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.runBenchmarks() >>>>> /home/peter/go/src/testing/benchmark.go:423 +0x86d >>>>> testing.(*M).Run() >>>>> /home/peter/go/src/testing/testing.go:928 +0x51e >>>>> main.main() >>>>> command-line-arguments/_test/_testmain.go:46 +0x1d3 >>>>> ================== >>>>> ================== >>>>> WARNING: DATA RACE >>>>> Write at 0x00c42000c030 by goroutine 7: >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:101 +0x475 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.(*B).run1.func1() >>>>> /home/peter/go/src/testing/benchmark.go:214 +0x6b >>>>> >>>>> Previous read at 0x00c42000c030 by goroutine 1078: >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:76 +0x9e >>>>> >>>>> Goroutine 7 (running) created at: >>>>> testing.(*B).run1() >>>>> /home/peter/go/src/testing/benchmark.go:207 +0x8c >>>>> testing.(*B).Run() >>>>> /home/peter/go/src/testing/benchmark.go:513 +0x482 >>>>> testing.runBenchmarks.func1() >>>>> /home/peter/go/src/testing/benchmark.go:417 +0xa7 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.runBenchmarks() >>>>> /home/peter/go/src/testing/benchmark.go:423 +0x86d >>>>> testing.(*M).Run() >>>>> /home/peter/go/src/testing/testing.go:928 +0x51e >>>>> main.main() >>>>> command-line-arguments/_test/_testmain.go:46 +0x1d3 >>>>> >>>>> Goroutine 1078 (running) created at: >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.(*B).run1.func1() >>>>> /home/peter/go/src/testing/benchmark.go:214 +0x6b >>>>> ================== >>>>> ================== >>>>> WARNING: DATA RACE >>>>> Read at 0x00c4200140c8 by goroutine 7: >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:109 +0x51d >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.(*B).run1.func1() >>>>> /home/peter/go/src/testing/benchmark.go:214 +0x6b >>>>> >>>>> Previous write at 0x00c4200140c8 by goroutine 175: >>>>> sync/atomic.AddInt64() >>>>> /home/peter/go/src/runtime/race_amd64.s:276 +0xb >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple.func1() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:88 +0x19a >>>>> >>>>> Goroutine 7 (running) created at: >>>>> testing.(*B).run1() >>>>> /home/peter/go/src/testing/benchmark.go:207 +0x8c >>>>> testing.(*B).Run() >>>>> /home/peter/go/src/testing/benchmark.go:513 +0x482 >>>>> testing.runBenchmarks.func1() >>>>> /home/peter/go/src/testing/benchmark.go:417 +0xa7 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.runBenchmarks() >>>>> /home/peter/go/src/testing/benchmark.go:423 +0x86d >>>>> testing.(*M).Run() >>>>> /home/peter/go/src/testing/testing.go:928 +0x51e >>>>> main.main() >>>>> command-line-arguments/_test/_testmain.go:46 +0x1d3 >>>>> >>>>> Goroutine 175 (finished) created at: >>>>> command-line-arguments.BenchmarkPubSubWaitGroupMultiple() >>>>> /home/peter/gopath/src/nuts/pubsub_test.go:93 +0x2e6 >>>>> testing.(*B).runN() >>>>> /home/peter/go/src/testing/benchmark.go:141 +0x12a >>>>> testing.(*B).run1.func1() >>>>> /home/peter/go/src/testing/benchmark.go:214 +0x6b >>>>> ================== >>>>> --- FAIL: BenchmarkPubSubWaitGroupMultiple >>>>> benchmark.go:147: race detected during execution of benchmark >>>>> FAIL >>>>> exit status 1 >>>>> FAIL command-line-arguments 0.726s >>>>> $ >>>>> >>>>> Peter >>>>> >>>>> On Sunday, July 16, 2017 at 10:20:21 AM UTC-4, Zohaib Sibte Hassan >>>>> wrote: >>>>>> >>>>>> I have been spending my day over implementing an efficient PubSub >>>>>> system. I had implemented one before using channels, and I wanted to >>>>>> benchmark that against sync.Cond. Here is the quick and dirty test that >>>>>> I >>>>>> put together >>>>>> https://gist.github.com/maxpert/f3c405c516ba2d4c8aa8b0695e0e054e. >>>>>> Now my confusion starts when I change GOMAXPROCS to test how it would >>>>>> perform on my age old Raspberry Pi. Here are results: >>>>>> >>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=8 go >>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out >>>>>> -benchmem >>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-8 10000 165419 ns/op >>>>>> 92 B/op 2 allocs/op >>>>>> BenchmarkPubSubWaitGroupMultiple-8 10000 204685 ns/op >>>>>> 53 B/op 2 allocs/op >>>>>> PASS >>>>>> ok sibte.so/rascore 3.749s >>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=4 go >>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out >>>>>> -benchmem >>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-4 20000 101704 ns/op >>>>>> 60 B/op 2 allocs/op >>>>>> BenchmarkPubSubWaitGroupMultiple-4 10000 204039 ns/op >>>>>> 52 B/op 2 allocs/op >>>>>> PASS >>>>>> ok sibte.so/rascore 5.087s >>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=2 go >>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out >>>>>> -benchmem >>>>>> BenchmarkPubSubPrimitiveChannelsMultiple-2 30000 51255 ns/op >>>>>> 54 B/op 2 allocs/op >>>>>> BenchmarkPubSubWaitGroupMultiple-2 20000 60871 ns/op >>>>>> 43 B/op 2 allocs/op >>>>>> PASS >>>>>> ok sibte.so/rascore 4.022s >>>>>> mxp@carbon:~/repos/raspchat/src/sibte.so/rascore$ GOMAXPROCS=1 go >>>>>> test -run none -bench Multiple -cpuprofile=cpu.out -memprofile=mem.out >>>>>> -benchmem >>>>>> BenchmarkPubSubPrimitiveChannelsMultiple 20000 79534 ns/op >>>>>> 61 B/op 2 allocs/op >>>>>> BenchmarkPubSubWaitGroupMultiple 100000 19066 ns/op >>>>>> 40 B/op 2 allocs/op >>>>>> PASS >>>>>> ok sibte.so/rascore 4.502s >>>>>> >>>>>> I tried multiple times and results are consistent. I am using Go >>>>>> 1.8, Linux x64, 8GB RAM. I have multiple questions: >>>>>> >>>>>> >>>>>> - Why do channels perform worst than sync.Cond in single core >>>>>> results? Context switching is same if anything it should perform >>>>>> worst. >>>>>> - As I increase the max procs the sync.Cond results go down which >>>>>> might be explainable, but what is up with channels? 20k to 30k to 20k >>>>>> to >>>>>> 10k :( I have a i5 with 4 cores, so it should have peaked at 4 procs >>>>>> (pst. >>>>>> I tried 3 as well it's consistent). >>>>>> >>>>>> I am still suspicious I am not making some kind of mistake in code. >>>>>> Any ideas? >>>>>> >>>>>> - Thanks >>>>>> >>>>> -- >> You received this message because you are subscribed to the Google Groups >> "golang-nuts" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to golang-nuts...@googlegroups.com <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.