On 7/26/06 10:14 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote:
> Mark Kirkwood <[EMAIL PROTECTED]> writes:
>> An obvious deduction is that the TPCH dataset is much more amenable to
>> run compression than my synthetic Zipfian data was. The interesting
>> question is how well "real" datasets are run compressable,
>
> Yeah --- the back-of-the-envelope calculations I was making presupposed
> uniform random distribution, and we know that's often not realistic for
> real datasets. A nonuniform distribution would probably mean that some
> of the bitmaps compress better-than-expected and others worse. I have
> no idea how to model that and guess what the overall result is ...
>
The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
Wu et al gave an approximate answer for this question. Assume that there are
c distinct values. Let the i-th value has a probability of p_i, the number
of rows r, and the word size w. then the total size of the compressed bitmap
index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
both \sum's, i is from 1 to c.
The constraint for this equation is \sum(p_i)=1. Therefore, when all p_i are
equal, or the attribute has randomly distributed values, the size of the
bitmap index is the largest.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster