1. Decompression: - Try https://github.com/klauspost/pgzip - it's a drop-in replacement for compress/gzip and the author claims that is has twice the decompression speed due to better buffering (lower IO wait) - Gzip decompression is single threaded - to use all cores decode multiple files at the same time - Your storage (EBS g2) has max throughput of 160MB/s per volume (see https://aws.amazon.com/ebs/details/ ) - assuming you get max throughput (which is not guaranteed) just reading 5GB and writing 20GB will take almost 3 minutes. When downloading you only have to write 5GB, which is why it's faster. To get better speeds use a ram drive (hey, you have 122GB of RAM), use st1 EBS or multiple EBS volumes in RAID0
2. Upload - If you're using AWS SDK with chunked uploads then stop - it will fist load the entire file into memory (so read + allocate...), then calculates hash of the whole file (and not a "cheap one") and then a signs each part of the file, eventually sending it over https (so +encryption...) - https://docs.aws.amazon.com/sdk-for-go/api/service/s3/s3manager/ with multi-part uploads. Experiment with part size to find the sweet spot (most likely somewhere between the minimum 5MB and 64MB), use higher concurrency value - default is 5 and that often is 10-50 times too low, YMWV On Friday, February 10, 2017 at 7:58:50 PM UTC-5, mukund....@gmail.com wrote: > > Hello, > > I have written a GO program which downloads a 5G compressed CSV from > Amazon S3, decompresses it and uploads the decompressed CSV (20G) to Amazon > S3. > > Amazon S3 provides a default concurrent uploader/downloader and I am using > a multithreaded approach to download files in parallel, decompress and > upload. The program seems to work fine, however I believe the program could > be optimized further. And not all the cores are used though I have > parallelized for the no. of CPUs available . The CPU usage is only around > 30-40% . I see a IO wait around 30/40% percent. > > The download happens faster, The decompression takes 5-6 minutes and the > upload happens in parallel but takes almost an hour to upload a set of 8 > files. > > For decompression, I use > reader, err := gzip.NewReader(gzipfile) > writer, err := os.Create(outputFile) > err = io.Copy(writer, reader) > > I use a 16CPU, 122 GB RAM, 500 GB SSD instance > > Are there any other methodologies where I can optimize compresssion part > and upload part > > I am pretty new to Golang. Any guidance is very much appreciated. > > Regards > Mukund > > > > > > > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.