The general idea is that for HTML content, you want content from the same domain to be adjacent on disk. This way duplicate HTML template runs get compressed REALLY well.
I think in our situations we would see exceptional compression. If we get closer to this I'll just implement snappy+bmdiff... On Thu, May 29, 2014 at 12:34 PM, Robert Coli <rc...@eventbrite.com> wrote: > On Sat, May 17, 2014 at 10:25 PM, Kevin Burton <bur...@spinn3r.com> wrote: > >> "compression" … sure.. but bmdiff? Not that I can find. BMDiff is an >> algorithm that in some situations could result in 100000x compression due >> to the way it's able to find long commons runs. This is a pathological >> case though. But if you were to copy the US constitution into itself >> … 100000x… bmdiff could ideally get a 100000x compression rate. >> >> not all compression algorithms are identical. >> > > The compression classes are pluggable. Exploratory patches are always > welcome! :D > > Not sure I understand why you consider Byte Ordered Partitioner relevant, > isn't what matters for compressibility generally the uniformity of data > within rows in the SSTable, not the uniformity of their row keys? > > =Rob > -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts> <http://spinn3r.com> War is peace. Freedom is slavery. Ignorance is strength. Corporations are people.