The VMs’ memory (4 GB) seems pretty small for Cassandra. What heap size are you 
using? Which garbage collector? Are you seeing long GC times on the nodes? The 
basic rule of thumb is to give the Cassandra heap 50% of the RAM on the host. 2 
GB isn’t very much.

Also, I wouldn’t set the replication factor to 5 (the number of nodes). If RF 
is always equal to the number of nodes, you can’t really scale beyond the size 
of the disk on any one node (all data is on each node). A replication factor of 
3 would be more like a typical production set-up.


Sean Durity

From: Daniel Seybold <daniel.seyb...@uni-ulm.de>
Sent: Friday, November 09, 2018 5:49 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Availability issues for write/update/read workloads (up to 
100s downtime) in case of a Cassandra node failure

Hi Apache Cassandra experts,

we are running a set of availability evaluations under a write/read/update 
workloads with Apache Cassandra and experience some unexpected results, i.e.  0 
ops/s over a period up to 100s.

In order to provide a clear picture find below the details of (1) the setup and 
(2) the evaluation workflow

1. Setup:
Cassandra version: 3.11.2
Cluster size: 5 nodes
Replication Factor: 5
Each nodes runs in the same private OpenStack based cloud, within the same 
availability zone and uses the private network.
Each nodes runs as OS Ubuntu 16.04 server and has 2 cores, 4GB RAM and 50GB 
disk.

Workload:
Yahoo Cloud Serving Benchmark 0.12
W1: 100% write
W2: 100% read
W3: 100% update

2. Evaluation Workflow:

1. allocate 5 VMs & deploy DBMS cluster
2. start a YCSB worklod (only one of W1-3) which runs up to 30 minutes
3. wait for 200s
4. trigger the selection of a  random node in the cluster and delete the VM 
without stopping  Cassandra before
5. analyze throughput time series over the evaluation

3. (Unexpected) Results

We expected to see a (slight) drop in the throughput as soon as the VM was 
deleted.
But the throughput results show that the there are periods of ~10s - 150s (not 
deterministic) where no operations are executed (all metrics are collected on 
client side)
Yet, there are no timeout exceptions on client side and also the logs on 
cluster side do not show anything that explains this behaviour.

I attached a series of plots which show the throughput and the downtimes over 
the evaluation runs.

Do you have any explanations for this behaviour or recommendations how to 
reduce the  potential "downtime" ?

Thanks in advance for any help and recommendations,

Cheers,
Daniel




--

M.Sc. Daniel Seybold



Universität Ulm

Institut Organisation und Management

von Informationssystemen (OMI)

Albert-Einstein-Allee 43

89081 Ulm

Phone: +49 (0)731 50-28 799

________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

Reply via email to