Re: [EXTERNAL] Writes and Reads with high latency

2018-12-28 Thread Marco Gasparini
- How many event_datetime records can you have per pkey?
during a day of work I can have less than 10 event_datetime records per
pkey.
Every day I maintain maximum 3 of them, so each new event_datetime for a
pkey determines a delete and an insert into Cassandra.

- How many pkeys (roughly) do you have?
Few millions but it is going to rise up.


- In general, you only want to have at most 100 MB of data per partition
(pkey). If it is larger than that, I would expect some timeouts. I suspect
you either have very wide rows or lots of tombstones

I ran some nodetool commands in order to give you more data:

CFSTATS output:

nodetool cfstats my_keyspace.my_table -H
Total number of tables: 52

Keyspace : my_keyspace
Read Count: 2441795
Read Latency: 400.53986035478 ms
Write Count: 5097368
Write Latency: 6.494159368913525 ms
Pending Flushes: 0
Table: my_table
SSTable count: 13
Space used (live): 185.45 GiB
Space used (total): 185.45 GiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 80.66 MiB
SSTable Compression Ratio: 0.2973552755387901
Number of partitions (estimate): 762039
Memtable cell count: 915
Memtable data size: 43.75 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 598
Local read count: 2441795
Local read latency: 93.186 ms
Local write count: 5097368
Local write latency: 3.189 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 5719
Bloom filter false ratio: 0.0
Bloom filter space used: 1.65 MiB
Bloom filter off heap memory used: 1.65 MiB
Index summary off heap memory used: 1.17 MiB
Compression metadata off heap memory used: 77.83 MiB
Compacted partition minimum bytes: 104
Compacted partition maximum bytes: 20924300
Compacted partition mean bytes: 529420
Average live cells per slice (last five minutes): 2.0
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 7.423841059602649
Maximum tombstones per slice (last five minutes): 50
Dropped Mutations: 0 bytes



CFHISTOGRAMS output:

nodetool cfhistograms my_keyspace my_table
my_keyspace/my_table histograms
Percentile  SSTables Write Latency  Read LatencyPartition Size
  Cell Count
  (micros)  (micros)   (bytes)
50%10.00379.02   1955.67379022
   8
75%12.00654.95 186563.16654949
  17
95%12.00  20924.30 268650.95   1629722
  35
98%12.00  20924.30 322381.14   2346799
  42
99%12.00  20924.30 386857.37   3379391
  50
Min 0.00  6.87 88.15   104
   0
Max12.00  25109.16 464228.84  20924300
 179

I tried to enable 'tracing on' on CQLSH cli and make some queries in order
to find out if there are tombstones scanned frequentely
but, in my little sample of queries, I got almost similar answers like the
following:

Preparing statement [Native-Transport-Requests-1]
Executing single-partition query on my_table [ReadStage-2]
Acquiring sstable references [ReadStage-2]
Bloom filter allows skipping sstable 2581 [ReadStage-2]
Bloom filter allows skipping sstable 2580 [ReadStage-2]
Bloom filter allows skipping sstable 2575 [ReadStage-2]
Partition index with 2 entries found for sstable 2570 [ReadStage-2]
Bloom filter allows skipping sstable 2548 [ReadStage-2]
Bloom filter allows skipping sstable 2463 [ReadStage-2]
Bloom filter allows skipping sstable 2416 [ReadStage-2]
Partition index with 3 entries found for sstable 2354 [ReadStage-2]
Bloom filter allows skipping sstable 1784 [ReadStage-2]
Partition index with 5 entries found for sstable 1296 [ReadStage-2]
Partition index with 3 entries found for sstable 1002 [ReadStage-2]
Partition index with 3 entries found for sstable 372 [ReadStage-2]
Skipped 0/12 non-slice-intersecting sstables, included 0 due to tombstones
[ReadStage-2]
Merged data from memtables and 5 sstables [ReadStage-2]
Read 3 live rows and 0 tombstone cells [ReadStage-2]
Request complete


- Since you mention lots of deletes, I am thinking it could be tombstones.
Are you getting any tombstone warnings or errors in your system.log?

For each pkey, I get a new event_datetime that makes me delete one of (max)
3 previously saved records in Cassandra.
If an pkey doesn't exist in Cassandra I will store it with its
event_datetime without deleting anything.

In Cassandra's logs I don't have any tombstone warning or error.


- When you delete, are you deleting a full partition?

Query for deletes:
delete from my_keyspace.my_table where pkey = ? and event_datetime = ? IF
EXISTS;


-  [..] And because only one node has the data, a single timeout means you
won’t get any data.

I will try to increase RF from 1 to 3.


I hope to have answered to a

Re: Is there any chance the bootstrapping lost data?

2018-12-28 Thread Jeff Jirsa




> On Dec 28, 2018, at 2:17 AM, Jinhua Luo  wrote:
> 
> Hi All,
> 
> While the pending node get streaming of token ranges from other nodes,
> all coordinator would send new writes to it so that it would not miss
> any new data, correct?
> 
> I have two (maybe silly) questions here:
> Given the CL is ONE,
> a) what if the coordinator haven't meet the pending node via gossip,
> and only sends mutation to the main replica (the replica would be
> replaced by the pending node)?

There’s a delay between joining in gossip and calculating the bootstrap 
streaming plan to mitigate this 

There’s also protections added in recent versions to avoid ack’ing a write (or 
read or stream) for ranges that don’t properly own the range, so 
topology/gossip disagreements shouldn’t result in consistency violations. 

> b) what if the coordinator fails to send the mutation to pending node?

The coordinators increase consistency level/ blockFor by one for each pending 
node, so the pending node may not get the write if your RF is > 1, but enough 
nodes will that you’ll meet your consistency guarantee

> 
> In both cases, when the pending node finish streaming and join the
> ring eventually, the mutation mentioned above would be lost?
> 
> Regards,
> Jinhua Luo
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-28 Thread Oleksandr Shulgin
On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

>
> After a fresh JVM start the memory allocation looks roughly like this:
>
>  total   used   free sharedbuffers cached
> Mem:   14G14G   173M   1.1M12M   3.2G
> -/+ buffers/cache:11G   3.4G
> Swap:   0B 0B 0B
>
> Then, within a number of days, the allocated disk cache shrinks all the
> way down to unreasonable numbers like only 150M.  At the same time "free"
> stays at the original level and "used" grows all the way up to 14G.
> Shortly after that the node becomes unavailable because of the IO and
> ultimately after some time the JVM gets killed.
>
> Most importantly, the resident size of JVM process stays at around 11-12G
> all the time, like it was shortly after the start.  How can we find where
> the rest of the memory gets allocated?  Is it just some sort of malloc
> fragmentation?
>

For the ones following along at home, here's what we ended up with so far:

0. Switched to the next biggest EC2 instance type, r4.xlarge: and the
symptoms are gone.  Our bill is dominated by the price EBS storage, so this
is much less than 2x increase in total.

1. We've noticed that increased memory usage correlates with the number of
SSTables on disk.  When the number of files on disk decreases, available
memory increases.  This leads us to think that extra memory allocation is
indeed due to use of mmap.  Not clear how we could account for that.

2. Improved our monitoring to include number of files (via total - free
inodes).

Given the cluster's resource utilization, it still feels like r4.large
would be a good fit, if only we could figure out those few "missing" GB of
RAM. ;-)

Cheers!
--
Alex


Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-28 Thread Yuri de Wit
We


On Fri, Dec 28, 2018, 4:23 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>>
>> After a fresh JVM start the memory allocation looks roughly like this:
>>
>>  total   used   free sharedbuffers cached
>> Mem:   14G14G   173M   1.1M12M   3.2G
>> -/+ buffers/cache:11G   3.4G
>> Swap:   0B 0B 0B
>>
>> Then, within a number of days, the allocated disk cache shrinks all the
>> way down to unreasonable numbers like only 150M.  At the same time "free"
>> stays at the original level and "used" grows all the way up to 14G.
>> Shortly after that the node becomes unavailable because of the IO and
>> ultimately after some time the JVM gets killed.
>>
>> Most importantly, the resident size of JVM process stays at around 11-12G
>> all the time, like it was shortly after the start.  How can we find where
>> the rest of the memory gets allocated?  Is it just some sort of malloc
>> fragmentation?
>>
>
> For the ones following along at home, here's what we ended up with so far:
>
> 0. Switched to the next biggest EC2 instance type, r4.xlarge: and the
> symptoms are gone.  Our bill is dominated by the price EBS storage, so this
> is much less than 2x increase in total.
>
> 1. We've noticed that increased memory usage correlates with the number of
> SSTables on disk.  When the number of files on disk decreases, available
> memory increases.  This leads us to think that extra memory allocation is
> indeed due to use of mmap.  Not clear how we could account for that.
>
> 2. Improved our monitoring to include number of files (via total - free
> inodes).
>
> Given the cluster's resource utilization, it still feels like r4.large
> would be a good fit, if only we could figure out those few "missing" GB of
> RAM. ;-)
>
> Cheers!
> --
> Alex
>
>


Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-28 Thread Jeff Jirsa
I’ve lost some context but there are two direct memory allocations per sstable 
- compression offsets and the bloom filter. Both of those get built during 
sstable creation and the bloom filter’s size is aggressively allocated , so 
you’ll see a big chunk of memory disappear as compaction kicks off based on the 
estimated number of keys.

Are you sure that’s not what you’re seeing? If it is, dropping bloom filter FP 
ratio or increasing compression chunk size may help (and probably saves you 
some disk, you’ll get better ratios but slightly slower by increasing that 
chunk size)

-- 
Jeff Jirsa


> On Dec 28, 2018, at 1:22 PM, Oleksandr Shulgin  
> wrote:
> 
>> On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin 
>>  wrote:
>> 
>> After a fresh JVM start the memory allocation looks roughly like this:
>> 
>>  total   used   free sharedbuffers cached
>> Mem:   14G14G   173M   1.1M12M   3.2G
>> -/+ buffers/cache:11G   3.4G
>> Swap:   0B 0B 0B
>> 
>> Then, within a number of days, the allocated disk cache shrinks all the way 
>> down to unreasonable numbers like only 150M.  At the same time "free" stays 
>> at the original level and "used" grows all the way up to 14G.  Shortly after 
>> that the node becomes unavailable because of the IO and ultimately after 
>> some time the JVM gets killed.
>> 
>> Most importantly, the resident size of JVM process stays at around 11-12G 
>> all the time, like it was shortly after the start.  How can we find where 
>> the rest of the memory gets allocated?  Is it just some sort of malloc 
>> fragmentation?
> 
> For the ones following along at home, here's what we ended up with so far:
> 
> 0. Switched to the next biggest EC2 instance type, r4.xlarge: and the 
> symptoms are gone.  Our bill is dominated by the price EBS storage, so this 
> is much less than 2x increase in total.
> 
> 1. We've noticed that increased memory usage correlates with the number of 
> SSTables on disk.  When the number of files on disk decreases, available 
> memory increases.  This leads us to think that extra memory allocation is 
> indeed due to use of mmap.  Not clear how we could account for that.
> 
> 2. Improved our monitoring to include number of files (via total - free 
> inodes).
> 
> Given the cluster's resource utilization, it still feels like r4.large would 
> be a good fit, if only we could figure out those few "missing" GB of RAM. ;-)
> 
> Cheers!
> --
> Alex
>