On Thursday, 11 August 2016 02:24:49 UTC+3, Alex Flint wrote:
>
> There are around 2M strings, and their total size is ~6 GB, so an average
> of 3k each.
>
What kind of data? How large is the alphabet? What is the distribution of
letters? Examples would be good :)
>
> I actually looked briefl
This looks like something that is solved for genomics data. If you are
OK with decompressing m strings where m << n then the BGZF addition to
gzip would work for you. In brief, BGZF blocks gzip into 64kb chunks
which can be indexed.
The spec for BGZF is here [1] (section 4 from page 11 on) and the
There are around 2M strings, and their total size is ~6 GB, so an average
of 3k each.
I actually looked briefly at Go's compress/flate to see whether something
like what you're describing is possible without writing my own compressor
but I couldn't see any obvious way to get at the underlying comp
On Wed, Aug 10, 2016 at 3:27 PM, Alex Flint wrote:
>
> I have long list of short strings that I want to compress, but I want to be
> able to decompress an arbitrary string in the list at any time without
> decompressing the entire list.
>
> I know the list ahead of time and it doesn't matter how m