Hi Tyler,

I tried what you said and false positives look much more reasonable there.  
Thanks for looking into this.


----- Original Message -----
From: "Tyler Hobbs" <ty...@datastax.com>
To: user@cassandra.apache.org
Sent: Friday, December 19, 2014 1:25:29 PM
Subject: Re: High Bloom Filter FP Ratio

I took a look at the code where the bloom filter true/false positive
counters are updated and notice that the true-positive count isn't being
updated on key cache hits:
https://issues.apache.org/jira/browse/CASSANDRA-8525.  That may explain
your ratios.

Can you try querying for a few non-existent partition keys in cqlsh with
tracing enabled (just run "TRACING ON") and see if you really do get that
high of a false-positive ratio?

On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene <green...@gmail.com> wrote:
> We're seeing similar behavior except our FP ratio is closer to 1.0 (100%).
> We're using Cassandra 2.1.2.
> Schema
> -----------------------------------------------------------------------
> CREATE TABLE contacts.contact (
>     id bigint,
>     property_id int,
>     created_at bigint,
>     updated_at bigint,
>     value blob,
>     PRIMARY KEY (id, property_id)
> *    AND bloom_filter_fp_chance = 0.001*
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'min_threshold': '4', 'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
> 'max_threshold': '32'}
>     AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
> CF Stats Output:
> -------------------------------------------------------------------------
> Keyspace: contacts
>     Read Count: 2458375
>     Read Latency: 0.8528440766766665 ms.
>     Write Count: 10357
>     Write Latency: 0.1816912233272183 ms.
>     Pending Flushes: 0
>         Table: contact
>         SSTable count: 61
>         SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0]
>         Space used (live): 9047112471
>         Space used (total): 9047112471
>         Space used by snapshots (total): 0
>         SSTable Compression Ratio: 0.34119240020241487
>         Memtable cell count: 24570
>         Memtable data size: 1299614
>         Memtable switch count: 2
>         Local read count: 2458290
>         Local read latency: 0.853 ms
>         Local write count: 10044
>         Local write latency: 0.186 ms
>         Pending flushes: 0
>         Bloom filter false positives: 11096
> *        Bloom filter false ratio: 0.99197*
>         Bloom filter space used: 3923784
>         Compacted partition minimum bytes: 373
>         Compacted partition maximum bytes: 152321
>         Compacted partition mean bytes: 9938
>         Average live cells per slice (last five minutes): 37.57851240677983
>         Maximum live cells per slice (last five minutes): 63.0
>         Average tombstones per slice (last five minutes): 0.0
>         Maximum tombstones per slice (last five minutes): 0.0
> --
> about.me <http://about.me/markgreene>
> On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart <ch...@remilon.com> wrote:
>> Hi,
>> I have create the following table with bloom_filter_fp_chance=0.01:
>> CREATE TABLE logged_event (
>>   time_key bigint,
>>   partition_key_randomizer int,
>>   resource_uuid timeuuid,
>>   event_json text,
>>   event_type text,
>>   field_error_list map<text, text>,
>>   javascript_timestamp timestamp,
>>   javascript_uuid uuid,
>>   page_impression_guid uuid,
>>   page_request_guid uuid,
>>   server_received_timestamp timestamp,
>>   session_id bigint,
>>   PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid)
>> ) WITH
>>   bloom_filter_fp_chance=0.010000 AND
>>   caching='KEYS_ONLY' AND
>>   comment='' AND
>>   dclocal_read_repair_chance=0.000000 AND
>>   gc_grace_seconds=864000 AND
>>   index_interval=128 AND
>>   read_repair_chance=0.000000 AND
>>   replicate_on_write='true' AND
>>   populate_io_cache_on_flush='false' AND
>>   default_time_to_live=0 AND
>>   speculative_retry='99.0PERCENTILE' AND
>>   memtable_flush_period_in_ms=0 AND
>>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>>   compression={'sstable_compression': 'LZ4Compressor'};
>> When I run cfstats, I see a much higher false positive ratio:
>>                 Table: logged_event
>>                 SSTable count: 15
>>                 Space used (live), bytes: 104128214227
>>                 Space used (total), bytes: 104129482871
>>                 SSTable Compression Ratio: 0.3295840184239226
>>                 Number of keys (estimate): 199293952
>>                 Memtable cell count: 56364
>>                 Memtable data size, bytes: 20903960
>>                 Memtable switch count: 148
>>                 Local read count: 1396402
>>                 Local read latency: 0.362 ms
>>                 Local write count: 2345306
>>                 Local write latency: 0.062 ms
>>                 Pending tasks: 0
>>                 Bloom filter false positives: 147705
>>                 Bloom filter false ratio: 0.49020
>>                 Bloom filter space used, bytes: 249129040
>>                 Compacted partition minimum bytes: 447
>>                 Compacted partition maximum bytes: 315852
>>                 Compacted partition mean bytes: 1636
>>                 Average live cells per slice (last five minutes): 0.0
>>                 Average tombstones per slice (last five minutes): 0.0
>> Any idea what could be causing this?  This is timeseries data.  Every
>> time we read from this table, we read a single row key with 1000
>> partition_key_randomizer values.  I'm running cassandra 2.0.11.  I tried
>> running an upgradesstables to rewrite them, which didn't change this
>> behavior at all.  I'm using size tiered compaction and I haven't done any
>> major compactions.
>> Thanks,
>> Chris

Tyler Hobbs
DataStax <http://datastax.com/>

Reply via email to