subject:"Re\: \[go\-nuts\] compressing long list of short strings"

Re: [go-nuts] compressing long list of short strings

2016-08-11 Thread Egon

On Thursday, 11 August 2016 02:24:49 UTC+3, Alex Flint wrote: > > There are around 2M strings, and their total size is ~6 GB, so an average > of 3k each. > What kind of data? How large is the alphabet? What is the distribution of letters? Examples would be good :) > > I actually looked briefl

Re: [go-nuts] compressing long list of short strings

2016-08-10 Thread Dan Kortschak

This looks like something that is solved for genomics data. If you are OK with decompressing m strings where m << n then the BGZF addition to gzip would work for you. In brief, BGZF blocks gzip into 64kb chunks which can be indexed. The spec for BGZF is here [1] (section 4 from page 11 on) and the

Re: [go-nuts] compressing long list of short strings

2016-08-10 Thread Alex Flint

There are around 2M strings, and their total size is ~6 GB, so an average of 3k each. I actually looked briefly at Go's compress/flate to see whether something like what you're describing is possible without writing my own compressor but I couldn't see any obvious way to get at the underlying comp

Re: [go-nuts] compressing long list of short strings

2016-08-10 Thread Ian Lance Taylor

On Wed, Aug 10, 2016 at 3:27 PM, Alex Flint wrote: > > I have long list of short strings that I want to compress, but I want to be > able to decompress an arbitrary string in the list at any time without > decompressing the entire list. > > I know the list ahead of time and it doesn't matter how m