Scaling issues with MSK on AWS

Arti Pande Wed, 09 Sep 2020 07:44:34 -0700

Hi,

We have been evaluating Managed Streaming for Kafka (MSK) on AWS for a use-case 
that requires high-speed data ingestion of the order of millions of messages 
(each ~1 KB size) per second. We ran into some issues when testing this case.


Context: 
To start with, we have set up single topic with 3 partitions on a 3 node MSK of 
m5.large (2 cores, 8 GB RAM, 500 GB EBS) with encryption enabled for 
inter-broker (intra-MSK) communication. Each broker is in a separate AZ (total 
3 AZs and 3 brokers) and has 10 network threads and 16 IO threads. 

When the topic has replication-factor = 2  and min.insync.replicas = 2 and 
publisher uses acks = all, when sending 100+ million messages using 3 parallel 
publishers intermittently results in following error.
          `Delivery failed: Broker: Not enough in-sync replicas` 
As per documentation this error is thrown when ins-sync replicas are lagging 
behind for more than a configured duration (replica.lag.time.max.ms=30 seconds 
as default). 

However when we don't see this error, the throughput is around 90 K msgs/sec 
i.e. 90 MB/sec. CPU usage is below 50% disk usage is also < 20%. So apparently 
CPU/Memory/Disk are not an issue ??

If we change replication-factor =1 and min.insync.replicas = 1 and/or ack=1 and 
keep all other things same, then there are no errors and throughput is ~380 K 
msgs.sec i.e. 380 MB/sec. CPU usage was below < 30 %

Question:
Without replication we were able to get 380 MB/sec written, so assuming disk or 
CPU or memory are not an issue. what could be the reason for replicas to lag 
behind at 90 MB/sec throughput? Is it the number of total threads (10 n/w + 16 
IO) being too high for a 2 core machine? But then same thread setting works 
good without replication. What could be the reason for (1) lesser throughput 
when turning replication on and (2) replicas lagging behind when replication is 
turned on?

Thanks
Arti

Scaling issues with MSK on AWS

Reply via email to