But the 2 direct questions i have are :

1. What is the structure of the Bloom Index ? Can you please let me know
what are the fields of a Bloom Index ? Is it just the Item Pointer
and BloomSignatureWord ?

I'm not sure of Postgres actual implementation, I have just looked at the underlying hash functions implementation.

When i describe my bloom index it looks like following.

postgres=# \d+ foo.idx_bloom_bar
                  Index "foo.idx_bloom_bar"
Column  |  Type   | Key? | Definition | Storage | Stats target
---------+---------+------+------------+---------+--------------
id      | integer | yes  | id         | plain   |
dept    | integer | yes  | dept       | plain   |
id2     | integer | yes  | id2        | plain   |
id3     | integer | yes  | id3        | plain   |
id4     | integer | yes  | id4        | plain   |
id5     | integer | yes  | id5        | plain   |
id6     | integer | yes  | id6        | plain   |
zipcode | integer | yes  | zipcode    | plain   |
bloom, for table "foo.bar"

The bloom index associates a signature, i.e. a bitfield the size of which is specified by the parameter "length", to the tuple location. The bitfield is computed by hashing the value of columns which are listed above in the index definition. The many values are somehow compressed into the small signature.

Options: length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
col8=4

I doubt that these parameters are good. The is no point to include a unique column into a bloom index. If you look at my blog, the number of bits associated to each field should depend on the expected selectivity, which depends on the entropy of each field and the signature size. The column entropy can be computed with a query.

2. If we are storing just one signature word per row, how is this
aggregated for all column values of that row into one signature in high
level ?

The aggregation, if performed, is not very useful in practice because it can only be effective on a few (first) bits, which are randomly computed anyway and the value of the query is not likely to hit them.

Fundamentally all bitfields are scanned to extract which tuples are possibly of interest, i.e. are not excluded by the index. The "full scan" of the bloom index is not a bad thing if it is much smaller than the table itself.

For example, if length = 64, does it mean that a bit array of 64 bits is
generated per each row ?

Yes.

If col1=4, does it mean the value of col1 is passed to 4 hash functions
that generate 4 positions that can be set to 1 in the bit array ?

Yes.

Try to apply the formula in the blog to see what you get for your parameters, but it is likely to be < 4.

--
Fabien.


Reply via email to