2020-06-17 15:07:12 UTC - Chris Herzog: @Chris Herzog has joined the channel ---- 2020-06-17 16:17:58 UTC - Leonard Ge: @Leonard Ge has joined the channel ---- 2020-06-17 17:59:36 UTC - Pedro Cardoso: @Sijie Guo Is the blog post out? I'm also very interested in hearing Pulsar's version. ---- 2020-06-17 18:05:35 UTC - Sijie Guo: @Pedro Cardoso not yet. We have the draft. It is under reviews by bunch of people in the community. We will probably publish in 1~2 weeks or so. ---- 2020-06-17 18:07:38 UTC - Pedro Cardoso: Could you share it? Even if a draft, it would still be useful as right now I'm benchmarking kafka (considering pulsar as a future improvement) and have some wierd results I would like to confirm. ---- 2020-06-17 18:07:50 UTC - Pedro Cardoso: If you have data to backup the claims it would be even better. ---- 2020-06-17 18:10:33 UTC - Sijie Guo: @Pedro Cardoso It is more on the clarifications on statements. Are you looking for benchmark results? ---- 2020-06-17 18:10:57 UTC - Pedro Cardoso: As a start yes. I'm writing a message in the main chat. Give me a couple minutes. ---- 2020-06-17 18:13:03 UTC - Sijie Guo: If you have performance issues, you can post the questions and we can help with them. For benchmark results, I think Splunk is going to publish their results soon. /cc @Karthik Ramasamy ---- 2020-06-17 18:13:15 UTC - Pedro Cardoso: Thank you so much! ---- 2020-06-17 18:18:13 UTC - Pedro Cardoso: Hello, Does anyone have benchmarks results comparing Kafka and Pulsar they could share?
Context: In my company we are looking for a messaging layer of what will be a mission critical, millisecond streaming engine and performing Kafka benchmarks to understand its performance for our use-case (max latency <100ms @ 5k TPS, with 99.999 SLAs, @ 99.95 SLA is <8 ms) Our benchmark is very simple, a single producer & consumer. We are sending a 2KB (generated once) bytebuffer payload, sending that through the producer and consuming (no custom logic, just an ack) it. The setup is a 3 node cluster in AWS all in different availability zones (we have tried Kubernetes, regular VMs even AWS's managed service) Replication factor of 3, 64 partitions, maximum ack level (processing and replica broker acks) with random partitioning. We use Zing's pauseless GC to reduce GC-induced latencies and SSDs to make sure we are not I/O bound but still have behaviour we can not account for. What we are seeing are a few latency spikes always in the order of 200ms which are over our SLAs: Producer: <https://drive.google.com/file/d/1duvmEKH56WwRisIh6v5p9bDDAWN_66VG/view?usp=sharing> Consumer: <https://drive.google.com/file/d/1KHX5y5YU84PNrC31Uc4QbUeLso-CZRxc/view?usp=sharing> In tabular form the latencies per percentile look something like: | | 0 | 25 | 50 | 75 | 90 | 95 | 99 | 99.9 | 99.99 | 99.999 | 100 | |:--------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:------:|:-------:|:-------:| | producer | 1 | 1.83 | 2.274 | 3.054 | 3.872 | 4.633 | 6.588 | 9.72 | 45.353 | 191.029 | 214.186 | | consumer | 1.059 | 1.683 | 2.255 | 3.092 | 3.966 | 4.695 | 6.623 | 9.871 | 117.38 | 199.024 | 214.254 | (Sorry... slack does not support markdown tables....) Has anyone seem this "spiky" type of behavior in Kafka and more pertinent in this chat, what about in Pulsar? It does not seem to be GC (we use zing and confirm <1 millisecond gc events with a stable heap, 3GB statically allocated via JVM options), CPU or Memory (we are over-provisioning machines, m5ad.2x large to be exact), Disk (300 GB SSDs + 20 GB EBS) or Network (10 Gbit connection) ---- 2020-06-17 19:06:19 UTC - Vijay Bhore: @Vijay Bhore has joined the channel ---- 2020-06-17 19:18:37 UTC - David Kjerrumgaard: <https://medium.com/@manrai.tarun/apache-pulsar-outperforms-apache-kafka-by-2-5x-on-openmessaging-benchmark-4838c14a541f> ---- 2020-06-17 19:19:06 UTC - David Kjerrumgaard: And you can run your own benchmark tests using the framework.... <https://github.com/openmessaging/openmessaging-benchmark> ---- 2020-06-17 19:24:51 UTC - Andrew: @Andrew has joined the channel ---- 2020-06-17 20:02:14 UTC - Pedro Cardoso: Thank you David! I may end up running the benchmarks adapted to my use-case and compare both technologies :slightly_smiling_face: crossed_fingers : David Kjerrumgaard ---- 2020-06-18 05:01:31 UTC - Amit Pal: 1. Can you verify Network is not the culprit here? Generally, tcp retransmission / packet loss are reasons for such latencies ... I believe IO wouldn't be saturated here 2. Can you look at pulsar broker metrics? Those will tell you the real picture, if network at all is an issue. 3. Use a mix of sync/async replication on msg producer (depends on how much durability you want) ---- 2020-06-18 07:15:04 UTC - Patrik Kleindl: Just a sidenote, in this test Kafka was intentionally configured against best practices to mirror what Pulsar does. The performance of Pulsar is great nonetheless, but Kafka is made to look worse on purpose. ---- 2020-06-18 07:38:25 UTC - PLarboulette: @PLarboulette has joined the channel ---- 2020-06-18 08:55:46 UTC - Pedro Cardoso: Indeed it looks like that is the case, the blog post itself does not show a lot of data. I was looking for something more akin to <https://kafkaesque.io/performance-comparison-between-apache-pulsar-and-kafka-latency/> ---- 2020-06-18 09:02:40 UTC - Pedro Cardoso: 1.) Network-wise, monitoring does not show saturation, we do not reach 10 Gbit. 2.) As of right now, these are only kafka tests, my intent with posting the question here is to ask whether in this chat's opinion, when comparing pulsar vs kafka, you detected similar latency spikes in kafka but not in Pulsar. Taking <https://kafkaesque.io/performance-comparison-between-apache-pulsar-and-kafka-latency/> as an example, I see similar behaviour for Kafka but not for Pulsar. However, there are still > 200ms maximum latencies. What causes those in Pulsar? 3.) For our use-case we need durability, for kafka we use `min.insync.replicas=2` . I believe Pulsar's default is equivalent? ----