Hi, We have been evaluating Managed Streaming for Kafka (MSK) on AWS for a use-case that requires high-speed data ingestion of the order of millions of messages (each ~1 KB size) per second. We ran into some issues when testing this case.
Context: To start with, we have set up single topic with 3 partitions on a 3 node MSK of m5.large (2 cores, 8 GB RAM, 500 GB EBS) with encryption enabled for inter-broker (intra-MSK) communication. Each broker is in a separate AZ (total 3 AZs and 3 brokers) and has 10 network threads and 16 IO threads. When the topic has replication-factor = 2 and min.insync.replicas = 2 and publisher uses acks = all, when sending 100+ million messages using 3 parallel publishers intermittently results in following error. `Delivery failed: Broker: Not enough in-sync replicas` As per documentation this error is thrown when ins-sync replicas are lagging behind for more than a configured duration (replica.lag.time.max.ms=30 seconds as default). However when we don't see this error, the throughput is around 90 K msgs/sec i.e. 90 MB/sec. CPU usage is below 50% disk usage is also < 20%. So apparently CPU/Memory/Disk are not an issue ?? If we change replication-factor =1 and min.insync.replicas = 1 and/or ack=1 and keep all other things same, then there are no errors and throughput is ~380 K msgs.sec i.e. 380 MB/sec. CPU usage was below < 30 % Question: Without replication we were able to get 380 MB/sec written, so assuming disk or CPU or memory are not an issue. what could be the reason for replicas to lag behind at 90 MB/sec throughput? Is it the number of total threads (10 n/w + 16 IO) being too high for a 2 core machine? But then same thread setting works good without replication. What could be the reason for (1) lesser throughput when turning replication on and (2) replicas lagging behind when replication is turned on? Thanks Arti