Re: [go-nuts] Go 1BRC Challenge and GC Overhead

Brendan Goldacker Wed, 24 Jul 2024 13:40:41 -0700

Thanks everyone for the leads and resources. After a few more days hacking 
away and experimenting with changes I was able to get a solution that runs 
under 20s (see here 
<https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/log_processor.go>). 
Turns out GC wasn't a problem. Its was mostly issues I created. There are a 
few things that really helped me get there:

1. *In place parsing lines* -- As Ben mentioned, I was doing duplicate work 
for each line. I would run `bytes.IndexByte(buffer '\n')` to find the end 
of the current line. Then I would pass the line to a 
`lineToMeasurement(line)` method which would iterate through each character 
to parse out the city and the temperature sample. So I was reading each 
character twice. I knew this duplication existed, but I didn't think much 
about it. Especially since `IndexByte` uses a SSE instruction (or so I 
believe it is), so I assumed this was negligent. However, when processing 1 
billion lines, a few nanoseconds per line really adds up.  Therefore I 
updated the lineToMeasure 
<https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/log_processor.go#L84> to 
handle the parsing in a single pass. This cut down a few seconds of 
processing. 

2.* Lower amount of consumer routines* -- This was just a dumb oversight by 
me. I was running `runtime.NumCPU()` consumer go routines. However, there 
is an additional producer go routine, and a reducer go routine. Just 
changing the workers to NumCPU - 1  
<https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/log_processor.go#L17>removed

all that stack management and go scheduler overhead I was seeing in the CPU 
profile. I think this saved ~6s. 

3. *Sequentially reading the file with `file.Read` instead of MMapping *-- 
Originally I MMapped the file since It make the code simpler to write at 
first, and also because I wanted to learn how to do this in Go. I knew 
sequential reading was likely to be slightly faster since mmaping tends to 
only be better if you're sharing data across process or threads, or are 
randomly accessing the file. However, I copied the approach from the blog 
post 
<https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/log_processor.go#L30-L52> 
to read the file sequentially with `file.Read`, then truncate the chunk on 
a '\n' char with `bytes.LastIndex(buf, '\n\)`. This approach cut down the 
runtime from ~30s to ~20s. I didn't expect this to be so drastically 
different. Prior, I had written a small benchmark to compare the speed of 
mmaping  
<https://github.com/Goldabj/1brc-go/blob/main/cmd/expirements/mmap/mmapedFile.go>versus

reading sequentially 
<https://github.com/Goldabj/1brc-go/blob/main/cmd/expirements/fileRead/fileReadExp.go#L20>.

Mmaping took ~5s to read all bytes, and the sequential approach took ~25s. 
However, the sequential approach uses a `Scanner`. So I'm assuming most of 
the additional time is due to the 'split' function, and not really with the 
`os.Read` calls. I'm still not sure why there is such a large difference in 
approaches. It could be because MMapping performs poorly on my platform? Or 
`bytes.LastIndex` is somehow more performant than `bytes.IndexByte`? Or 
sequential reading has some underlying benefit (like better page cache hit 
rate, or L1/L2 cache hit rate)? 

There are some additional things I've noticed, but I don't have answers 
for: 

1. *Large buffer channels *-- Originally I was allocating a chunks channel 
and a results channel 
<https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/log_processor.go#L20-L21> 
with 
a larger buffer (128). I assumed having a large buffer would avoid the risk 
of the buffers becoming full, and the channel become blocking. However, 
after lowering the buffer to ~8 - 12 I saw some significant reductions in 
runtimes, especially with the variance around runtimes.  With large 
buffers, I would get ~35s runtimes (and sometimes up to 45s). But when I 
lowered the buffer size to 12, then the runtimes were between 19s and 22s 
every time. This seems counter intuitive to me. I assumed a channel was 
just some queue/array with a mutex. I think I'll need to do some more 
reading up on channels. 
*I will mention that I was running this while my laptop was low on battery, 
and some things were running in the background. So this may have been just 
a coincidence and channel size doesn't make any difference. *

2. *Seg faults when not copying slices* -- I threw away this code, but in 
one attempt I was seeing Seg faults when trying to read a slice from a 
struct, and I'm not sure why. I had updated my `Measurement` struc 
<https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/measurement.go#L7>t 
to store a `cityBytes []byte`. Then on this line 
<https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/log_processor.go#L103> I 
would set `citybytes: line[:idx]`. However, at a later point I would later 
try to reference `measurement.CityBytes[0]` and I would get a seg fault. 
The only fix was to copy the slice with `city := make([]byte, 
len(line[:idx])`, then `cityBytes: copy(city, line[:idx])` . However, I 
don't understand why this is needed. I though the underlying array should 
still exist if at least one slice still referenced it. I couldn't come up 
with a simple example to reproduce it, so :shrug: . 

Thanks all, Brendan
On Tuesday, July 23, 2024 at 8:46:42 AM UTC-5 Marcos Issler wrote:

> Hi Brendan, take a look of this amazing event of Ardan Labs where they 
> explore in depth how to analyze the GC. I think you will find your answer 
> how the GC works and the best way to know how much goroutines vs GC memory 
> use for your best performance. 
>
> You need to sign in into Ardan Labs but it is free. I didn't find the 
> video outside. 
>
>
> https://courses.ardanlabs.com/courses/take/go-trace-tooling-concurrency/lessons/56254473-webinar-with-bill-kennedy-june-25-2024
>
> They have some interesting info about it in the blogs page too: 
>
>
> https://www.ardanlabs.com/blog/2018/12/garbage-collection-in-go-part1-semantics.html
> https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-part1.html
>
> Bests,
> Marcos Issler 
>
> Em segunda-feira, 22 de julho de 2024 às 21:22:25 UTC-3, ben...@gmail.com 
> escreveu:
>
>> Hi Brendan, this is a fun problem. I'm looking at 
>> https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/log_processor.go, 
>> and I suspect the main thing is that you're converting []byte to string, 
>> which almost certainly allocates. You're also scanning through the bytes 
>> several times (first to find the '\n', then the ';', then the '.'). And 
>> you're calling ParseInt (twice), which does more work than you need to for 
>> this problem. If you want to go further, I'd suggest keeping things as 
>> []byte, and making the code search for the various delimiters *and* parse 
>> at the same time.
>>
>> If you want to "cheat" :-), you could look at my write-up, where I 
>> present 9 (incrementally faster) solutions to this problem: 
>> https://benhoyt.com/writings/go-1brc/
>>
>> -Ben
>>
>> On Tuesday, July 23, 2024 at 8:17:57 AM UTC+12 Ian Lance Taylor wrote:
>>
>>> On Mon, Jul 22, 2024 at 1:05 PM Brendan Goldacker <brend...@gmail.com> 
>>> wrote:
>>>
>>>> I've been using the 1 billion row challenge 
>>>> <https://github.com/gunnarmorling/1brc/discussions/67> as a way to 
>>>> learn go over the past few days. With my limited go experience, I've seem 
>>>> to hit a wall. 
>>>>
>>>> I have my submission here <https://github.com/Goldabj/1brc-go>. On my 
>>>> M1 macbook pro with 16GB of memory, it runs in ~39s. I can get it a few 
>>>> seconds faster if I play around with some parameters (numWorkers and 
>>>> chunkSizes). I'm trying to make it a goal to get <20s. However, I'm 
>>>> stuck.
>>>>
>>>> I came across another submission from this blog post  
>>>> <https://www.bytesizego.com/blog/one-billion-row-challenge-go>. I've 
>>>> copied over the submission to my repo (otherSubmission.go) and it runs 
>>>> in ~19s. I was looking over their solution and it appears very similar to 
>>>> mine in apprach. The only difference is that they use os.Read to read 
>>>> the file sequentially, and my submission uses MMap. Plus a few other 
>>>> minor differences. Overall, both approaches look to be about the same. 
>>>> However, theirs runs in about half the time. I can't find anything that 
>>>> stands out to why there would be such a large difference in runtime.
>>>>
>>>> I've been trying to use the profile and execution traces to figure out 
>>>> why. Below are my flameGraphs and execution traces
>>>> [image: Screenshot 2024-07-22 at 1 19 40 PM] 
>>>> <https://private-user-images.githubusercontent.com/11671931/351074916-985123e4-966c-4e17-9706-4275b4143e7a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE2NzQ2NDcsIm5iZiI6MTcyMTY3NDM0NywicGF0aCI6Ii8xMTY3MTkzMS8zNTEwNzQ5MTYtOTg1MTIzZTQtOTY2Yy00ZTE3LTk3MDYtNDI3NWI0MTQzZTdhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzIyVDE4NTIyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFlY2E3ZGRlOGQ0Njc5NGJlZDAzMTFjYjk2ZmQwMDhlOTk3MjM4NzNmZjNkNzI3N2RmZjAzYTIwNTRhM2NmNGQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.A930je5Bc_a8pcF2TD_2zvLY5ZINVL_FH5bnq9l_Pig>
>>>>  [image: Screenshot 2024-07-22 at 1 21 47 PM] 
>>>> <https://private-user-images.githubusercontent.com/11671931/351074943-9340b735-5858-4de8-a1b2-e5d07ea3762b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE2NzQ2NDcsIm5iZiI6MTcyMTY3NDM0NywicGF0aCI6Ii8xMTY3MTkzMS8zNTEwNzQ5NDMtOTM0MGI3MzUtNTg1OC00ZGU4LWExYjItZTVkMDdlYTM3NjJiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzIyVDE4NTIyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI4ODBiYTcyOGM2OTk5ZmJlMTU2ZGViMTg0N2VhYjRmNmUxOTZlZjcwYWY2YWU1OTY3MzcxNTc0MjZhNzZiMDUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.FhWgcWxT992KkMV6m21LoQnCJU5-JaoCP2TcTfidSNE>
>>>>
>>>> Below are the otherSubmission profile and execution traces:
>>>> [image: Screenshot 2024-07-22 at 1 23 39 PM] 
>>>> <https://private-user-images.githubusercontent.com/11671931/351075058-d46a2447-2e9e-49c2-97f4-2d48b9451490.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE2NzQ2NDcsIm5iZiI6MTcyMTY3NDM0NywicGF0aCI6Ii8xMTY3MTkzMS8zNTEwNzUwNTgtZDQ2YTI0NDctMmU5ZS00OWMyLTk3ZjQtMmQ0OGI5NDUxNDkwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzIyVDE4NTIyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWMwZDY4MGJhOWU3M2Q3ZjUwNTAxYjZhYjdlZTI3NGFhMGZhOTc3ZmI5Y2Q0Njc4MzcwMGJjMmYzMjdlM2RhZjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.vQrfAS9DP6V2wp_NOht_nD1xdEWNdug7jupBOTqpoOg>
>>>>  [image: Screenshot 2024-07-22 at 1 22 47 PM] 
>>>> <https://private-user-images.githubusercontent.com/11671931/351075091-20e2a17d-0f29-4bfc-9915-081d923c9130.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE2NzQ2NDcsIm5iZiI6MTcyMTY3NDM0NywicGF0aCI6Ii8xMTY3MTkzMS8zNTEwNzUwOTEtMjBlMmExN2QtMGYyOS00YmZjLTk5MTUtMDgxZDkyM2M5MTMwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzIyVDE4NTIyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQxNDA3NjhlMWNmODdiNmI1ZGM1YTRmNTI1MTc4YWIzYjVlNTNkOWM5OWExZWMyODc3YTc4NzA3MDJlNjEyZDcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Rjmzp26Tu5rPReCAacP7d0euX5NeRxszBJ5TnEoLSyw>
>>>>
>>>> The first thing that stands out to my are the # of GC collections. My 
>>>> submission seems to be running GC much more frequently - 100,000 
>>>> incremental sweeps, 32,000 background sweeps, and around 1000 stop the 
>>>> world GCs. The heap only grows to 180MB at its peak, and typically stays 
>>>> around 72MB. In the otherSubmission, incremental GC only runs 6 times, and 
>>>> stop the work is ~30 times. Their heap grows much larger (up to 3.5GB), 
>>>> and 
>>>> GC happens less frequently.
>>>>
>>>> Also, looking at the flame graphs, it looks like there is much more 
>>>> time being spent on managing the stack systemstack, runtime.morestack, 
>>>> and runtime.mcall. I think this may be due to more scheduler overhead 
>>>> due to the frequent GCs (waiting for routines to get to a stopping point, 
>>>> running GC, then re-scheduling threads to start again).
>>>>
>>>> I believe the time discrepancy is mostly due to GC overhead and go 
>>>> scheduler overhead. Although, I'm not sure why.
>>>>
>>>> Could anyone help a newbie figure out where I'm going wrong? Also could 
>>>> share any tools or tricks that could help me figure this out?
>>>>
>>>
>>> I have not looked at your code.  But if the GC is running a lot, the 
>>> first step is to look at where you are allocating memory, using Go's memory 
>>> profiler.  It may help to read https://go.dev/blog/pprof.  See if you 
>>> can adjust your code to allocate less memory.
>>>
>>> Ian
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/d2400542-7492-47a1-b411-554b54d9ba49n%40googlegroups.com.

Re: [go-nuts] Go 1BRC Challenge and GC Overhead

Reply via email to