Thanks DuyHai, I think the trouble of bloom filter on all row keys & column names is memory usage. However, if a CF has only hundreds of columns per row, the number of total columns will be much fewer, so the bloom filter is possible for this condition, right? Is there a good way to adjust bloom filter's property between row keys and row keys+column names automatically or by user's config?
Thanks, Philo Yang 2014-09-15 2:45 GMT+08:00 DuyHai Doan <doanduy...@gmail.com>: > Hello Philo > > Building bloom filter for column names (what you call column key) is > technically possible but very expensive in term of memory usage. > > The approximate formula to calculate space required by bloom filter can > be found on slide 27 here: > http://fr.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees > > false positive chance = 0.6185 * m/n where m = number of bits for the > filter and n = number of distinct keys > > For example, if you want to index 1 million of rows, each having 100 000 > columns on average, it will end up indexing 100 billions of keys (row keys > & column names) with bloom filter. > > By applying the above formula, m ≈ 4.8 * 10^11 bits ≈ 60Gb to allocate in > RAM just for bloom filter on all row keys & column names ... > > Regards > > Duy Hai DOAN > > On Sun, Sep 14, 2014 at 11:22 AM, Philo Yang <ud1...@gmail.com> wrote: > >> Hi all, >> >> After reading some docs, I find that bloom filter is built on row keys, >> not on column key. Can anyone tell me what is considered for not building >> bloom filter on column key? Is it a good idea to offer a table property >> option between row key and primary key for what boolm filter is built on? >> >> Thanks, >> Philo Yang >> >> >