I wrote a multi-threaded duplicate file checker using md5, here is the complete source: https://github.com/hbfs/dupe_check/blob/master/dupe_check.go
Benched two variants on the same machine, on the same set of files (~1.7GB folder with ~600 files, each avg 3MB), multiple times, purging disk cache in between each run. With this code: hash := md5.New() if _, err := io.Copy(hash, file); err != nil { fmt.Println(err) } var md5sum [md5.Size]byte copy(md5sum[:], hash.Sum(nil)[:16]) *// 3.35s user 105.20s system 213% cpu 50.848 total, memory usage is ~ 30MB* With this code: data, err := ioutil.ReadFile(path) if err != nil { fmt.Println(err) } md5sum := md5.Sum(data) * // 3.10s user 31.75s system 104% cpu 33.210 total, memory usage is ~ 1.52GB* The memory usage make sense, but why is the streaming version ~3x slower than the read the entire file into memory version? This trade off doesn't make sense to me since the file is being read from disk in both situations which should be the limiting factor. Then the md5sum is being computed. In the streaming version, there is an extra copy from []byte to [16]byte but that should be negligible. My only theory I can think of is context switching streaming version: disk -> processor processor waiting for disk read so it switch to read another file, sleeping the thread. entire file: disk -> memory -> processor file is in memory so not as much context switching. What do you think? Thanks! -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.