The general idea is that for HTML content, you want content from the same
domain to be adjacent on disk. This way duplicate HTML template runs get
compressed REALLY well.
I think in our situations we would see exceptional compression.
If we get closer to this I'll just implement snappy+bmdiff...
On Sat, May 17, 2014 at 10:25 PM, Kevin Burton wrote:
> "compression" … sure.. but bmdiff? Not that I can find. BMDiff is an
> algorithm that in some situations could result in 10x compression due
> to the way it's able to find long commons runs. This is a pathological
> case though. But i
"compression" … sure.. but bmdiff? Not that I can find. BMDiff is an
algorithm that in some situations could result in 10x compression due
to the way it's able to find long commons runs. This is a pathological
case though. But if you were to copy the US constitution into itself
… 10x… bm
Cassandra offers compression out of the box. Look into the options available
upon table creation.
The use of orderedpartitioner is an anti-pattern 999/1000 times. It creates
hot spots - the use of wide rows can often accomplish the same result through
the use of clustering columns.
--
Colin
So I see that Cassandra doesn't support bmdiff/vcdiff.
Is this primarily because most people aren't using the ordered partitioner?
bmdiff gets good compression by storing similar content next to each page
on disk. So lots of HTML content would compress well.
but if everything is being stored a