Slack digest for #dev - 2020-06-18

Apache Pulsar Slack Thu, 18 Jun 2020 02:11:22 -0700

2020-06-17 15:07:12 UTC - Chris Herzog: @Chris Herzog has joined the channel
----
2020-06-17 16:17:58 UTC - Leonard Ge: @Leonard Ge has joined the channel
----
2020-06-17 17:59:36 UTC - Pedro Cardoso: @Sijie Guo Is the blog post out? I'm 
also very interested in hearing Pulsar's version.
----
2020-06-17 18:05:35 UTC - Sijie Guo: @Pedro Cardoso not yet. We have the draft. 
It is under reviews by bunch of people in the community. We will probably 
publish in 1~2 weeks or so.
----
2020-06-17 18:07:38 UTC - Pedro Cardoso: Could you share it? Even if a draft, 
it would still be useful as right now I'm benchmarking kafka (considering 
pulsar as a future improvement) and have some wierd results I would like to 
confirm.
----
2020-06-17 18:07:50 UTC - Pedro Cardoso: If you have data to backup the claims 
it would be even better.
----
2020-06-17 18:10:33 UTC - Sijie Guo: @Pedro Cardoso It is more on the 
clarifications on statements. Are you looking for benchmark results?
----
2020-06-17 18:10:57 UTC - Pedro Cardoso: As a start yes. I'm writing a message 
in the main chat. Give me a couple minutes.
----
2020-06-17 18:13:03 UTC - Sijie Guo: If you have performance issues, you can 
post the questions and we can help with them. For benchmark results, I think 
Splunk is going to publish their results soon. /cc @Karthik Ramasamy
----
2020-06-17 18:13:15 UTC - Pedro Cardoso: Thank you so much!
----
2020-06-17 18:18:13 UTC - Pedro Cardoso: Hello,
Does anyone have benchmarks results comparing Kafka and Pulsar they could share?


Context:
In my company we are looking for a messaging layer of what will be a mission 
critical, millisecond streaming engine and performing Kafka benchmarks to 
understand its performance for our use-case (max latency &lt;100ms @ 5k TPS, 
with 99.999 SLAs, @ 99.95 SLA is &lt;8 ms)
Our benchmark is very simple, a single producer &amp; consumer. We are sending 
a 2KB (generated once) bytebuffer payload, sending that through the producer 
and consuming (no custom logic, just an ack) it.
The setup is a 3 node cluster in AWS all in different availability zones (we 
have tried Kubernetes, regular VMs even AWS's managed service)
Replication factor of 3, 64 partitions, maximum ack level (processing and 
replica broker acks) with random partitioning.
We use Zing's pauseless GC to reduce GC-induced latencies and SSDs to make sure 
we are not I/O bound but still have behaviour we can not account for.

What we are seeing are a few latency spikes always in the order of 200ms which 
are over our SLAs:
Producer: 
<https://drive.google.com/file/d/1duvmEKH56WwRisIh6v5p9bDDAWN_66VG/view?usp=sharing>
Consumer: 
<https://drive.google.com/file/d/1KHX5y5YU84PNrC31Uc4QbUeLso-CZRxc/view?usp=sharing>

In tabular form the latencies per percentile look something like:
|          |   0   |   25  |   50  |   75  |   90  |   95  |   99  |  99.9 |  
99.99 |  99.999 |   100   |
|:--------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:------:|:-------:|:-------:|
| producer |     1 |  1.83 | 2.274 | 3.054 | 3.872 | 4.633 | 6.588 |  9.72 | 
45.353 | 191.029 | 214.186 |
| consumer | 1.059 | 1.683 | 2.255 | 3.092 | 3.966 | 4.695 | 6.623 | 9.871 | 
117.38 | 199.024 | 214.254 |

(Sorry... slack does not support markdown tables....)

Has anyone seem this "spiky" type of behavior in Kafka and more pertinent in 
this chat, what about in Pulsar? It does not seem to be GC (we use zing and 
confirm &lt;1 millisecond gc events with a stable heap, 3GB statically 
allocated via JVM options), CPU or Memory (we are over-provisioning machines, 
m5ad.2x large to be exact), Disk (300 GB SSDs + 20 GB EBS) or Network (10 Gbit 
connection)
----
2020-06-17 19:06:19 UTC - Vijay Bhore: @Vijay Bhore has joined the channel
----
2020-06-17 19:18:37 UTC - David Kjerrumgaard: 
<https://medium.com/@manrai.tarun/apache-pulsar-outperforms-apache-kafka-by-2-5x-on-openmessaging-benchmark-4838c14a541f>
----
2020-06-17 19:19:06 UTC - David Kjerrumgaard: And you can run your own 
benchmark tests using the framework.... 
<https://github.com/openmessaging/openmessaging-benchmark>
----
2020-06-17 19:24:51 UTC - Andrew: @Andrew has joined the channel
----
2020-06-17 20:02:14 UTC - Pedro Cardoso: Thank you David! I may end up running 
the benchmarks adapted to my use-case and compare both technologies 
:slightly_smiling_face:
crossed_fingers : David Kjerrumgaard
----
2020-06-18 05:01:31 UTC - Amit Pal: 1. Can you verify Network is not the 
culprit here? Generally, tcp retransmission / packet loss are reasons for such 
latencies ... I believe IO wouldn't be saturated here
2. Can you look at pulsar broker metrics? Those will tell you the real picture, 
if network at all is an issue.
3. Use a mix of sync/async replication on msg producer (depends on how much 
durability you want)
----
2020-06-18 07:15:04 UTC - Patrik Kleindl: Just a sidenote, in this test Kafka 
was intentionally configured against best practices to mirror what Pulsar does. 
The performance of Pulsar is great nonetheless, but Kafka is made to look worse 
on purpose.
----
2020-06-18 07:38:25 UTC - PLarboulette: @PLarboulette has joined the channel
----
2020-06-18 08:55:46 UTC - Pedro Cardoso: Indeed it looks like that is the case, 
the blog post itself does not show a lot of data. I was looking for something 
more akin to 
<https://kafkaesque.io/performance-comparison-between-apache-pulsar-and-kafka-latency/>
----
2020-06-18 09:02:40 UTC - Pedro Cardoso: 1.) Network-wise, monitoring does not 
show saturation, we do not reach 10 Gbit.
2.) As of right now, these are only kafka tests, my intent with posting the 
question here is to ask whether in this chat's opinion, when comparing pulsar 
vs kafka, you detected similar latency spikes in kafka but not in Pulsar. 
Taking 
<https://kafkaesque.io/performance-comparison-between-apache-pulsar-and-kafka-latency/>
 as an example, I see similar behaviour for Kafka but not for Pulsar. However, 
there are still &gt; 200ms maximum latencies. What causes those in Pulsar?
3.) For our use-case we need durability, for kafka we use 
`min.insync.replicas=2` . I believe Pulsar's default is equivalent?
----

Slack digest for #dev - 2020-06-18

Reply via email to