Hi, After making the task.managed. Memory. fraction as 0 , i see that JVM heap memory increased from 512 mb to 1 GB.
Earlier I was getting a maximum of 4-6k per second throughput on Kafka source for ingestion rate of 12k+/second. Now I see that improved to 11k per task(parallelism of 1) and 16.5k+ second when run with parallelism of 2. (8.25k per task).. The maximum memory used during the run was 500 mb of heap space. >From this exercise , I understand that increasing JVM memory would directly support/increase throughout. Am i correct? Our goal is to test for 100k ingestion per second and try to calculate cost for 1 million per second ( hope it's linear relation) I also saw the CPU utilisation peaked to 50% during the same. 1) Let me know what you think of the same, as I would continue to test. 2) Is there a benchmark for number of records handled per Kafka connector task for a particular JVM heap number. Thanks, Prasanna On Fri 17 Jul, 2020, 06:18 Xintong Song, <tonysong...@gmail.com> wrote: > *I had set Checkpoint to use the Job manager backend.* > > Jobmanager backend also runs in JVM heap space and does not use managed > memory. Setting managed memory fraction to 0 will give you larger JVM heap > space, thus lesser GC pressure. > > Thank you~ > > Xintong Song > > > > On Thu, Jul 16, 2020 at 10:38 PM Prasanna kumar < > prasannakumarram...@gmail.com> wrote: > >> >> Xintong Song, >> >> >> - Which version of Flink is used? *1.10* >> - Which deployment mode is used? *Standalone* >> - Which cluster mode is used? *Job* >> - Do you mean you have a 4core16gb node for each task manager, and >> each task manager has 4 slots? *Yeah*. *There are totally 3 >> taskmanagers in the cluster. 2TMs are t2.medium machine 2 core 4 gb per >> machine. 1 slot per core. 1TM is t2.large 4core 16gb . 4slots in the >> machine. There were other jobs running in the t2.medium TMs. T2.large >> machine is where the performance testing job was running. * >> - Sounds like you are running a streaming job without using any >> state. Have you tuned the managed memory fraction >> (`taskmanager.memory.managed.fraction`) to zero as suggested in the >> document[1]? *No i have not set the >> taskmanager.memory.network.fraction to 0. I had set Checkpoint to use the >> Job manager backend. * >> - *The CPU maximum spike i spotted was 40%. * >> >> *Between i did some latest test only on t2.medium machine with 2 slots >> per core. 1million records with 10k/s ingestion rate. Parallelism was 1. * >> *I added rebalance to the inputstream. ex: * >> inputStream.rebalance().map() >> *I was able to get latency in the range 130ms - 2sec.* >> >> Let me also know if there are more things to consider here. >> >> Thanks >> Prasanna. >> >> On Thu, Jul 16, 2020 at 4:04 PM Xintong Song <tonysong...@gmail.com> >> wrote: >> >>> Hi Prasanna, >>> >>> Trying to understand how Flink is deployed. >>> >>> - Which version of Flink is used? >>> - Which deployment mode is used? (Standalone/Kubernetes/Yarn/Mesos) >>> - Which cluster mode is used? (Job/Session) >>> - Do you mean you have a 4core16gb node for each task manager, and >>> each task manager has 4 slots? >>> - Sounds like you are running a streaming job without using any >>> state. Have you tuned the managed memory fraction >>> (`taskmanager.memory.managed.fraction`) to zero as suggested in the >>> document[1]? >>> >>> When running a stateless job or using a heap state backend >>>> (MemoryStateBackend or FsStateBackend), set managed memory to zero. >>>> >>> >>> I can see a few potential problems. >>> >>> - Managed memory is probably not configured. That means a >>> significant fraction of memory is unused. >>> - It sounds like the CPU processing time is not the bottleneck. Thus >>> increasing the parallelism will not give you better performance, but will >>> on the other hand increase the overhead load on the task manager. >>> >>> Also pulled in Becket Qin, who is the expert of Kafka connectors. Since >>> you have observed lack of performance in reading from Kafka compared to >>> Storm. >>> >>> Thank you~ >>> >>> Xintong Song >>> >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_tuning.html#heap-state-backend >>> >>> On Thu, Jul 16, 2020 at 10:35 AM Prasanna kumar < >>> prasannakumarram...@gmail.com> wrote: >>> >>>> Hi >>>> >>>> Sending to you all separately as you answered one of my earlier query. >>>> >>>> Thanks, >>>> Prasanna. >>>> >>>> >>>> ---------- Forwarded message --------- >>>> From: Prasanna kumar <prasannakumarram...@gmail.com> >>>> Date: Wed 15 Jul, 2020, 23:27 >>>> Subject: Performance test Flink vs Storm >>>> To: <d...@flink.apache.org>, user <user@flink.apache.org> >>>> >>>> >>>> Hi, >>>> >>>> We are testing flink and storm for our streaming pipelines on various >>>> features. >>>> >>>> In terms of Latency,i see the flink comes up short on storm even if >>>> more CPU is given to it. Will Explain in detail. >>>> >>>> *Machine*. t2.large 4 core 16 gb. is used for Used for flink task >>>> manager and storm supervisor node. >>>> *Kafka Partitions* 4 >>>> *Messages tested:* 1million >>>> *Load* : 50k/sec >>>> >>>> *Scenario*: >>>> Read from Kafka -> Transform (Map to a different JSON format) - > Write >>>> to a Kafka topic. >>>> >>>> *Test 1* >>>> Storm Parallelism is set as 1. There are four processes. 1 Spout (Read >>>> from Kafka) and 3 bolts (Transformation and sink) . >>>> Flink. Operator level parallelism not set. Task Parallelism is set as >>>> 1. Task slot is 1 per core. >>>> >>>> Storm was 130 milliseconds faster in 1st record. >>>> Storm was 20 seconds faster in 1 millionth record. >>>> >>>> *Test 2* >>>> Storm Parallelism is set as 1. There are four processes. 1 Spout (Read >>>> from Kafka) and 3 bolts (Transformation and sink) >>>> Flink. Operator level parallelism not set. Task Parallelism is set as >>>> 4. Task slot is 1 per core. So all cores is used. >>>> >>>> Storm was 180 milliseconds faster in 1st record. >>>> Storm was 25 seconds faster in 1 millionth record. >>>> >>>> *Observations here* >>>> 1) Increasing Parallelism did not increase the performance in Flink >>>> rather it became 50ms to 5s slower. >>>> 2) Flink is slower in Reading from Kafka compared to storm. Thats where >>>> the bulk of the latency is. for the millionth record its 19-24 seconds >>>> slower. >>>> 3) Once message is read, flink takes lesser time to transform and write >>>> to kafka compared to storm. >>>> >>>> *Other Flink Config* >>>> jobmanager.heap.size: 1024m >>>> >>>> taskmanager.memory.process.size: 1568m >>>> >>>> *How do we improve the latency ? * >>>> *Why does latency becomes worse when parallelism is increased and >>>> matched to partitions?* >>>> >>>> Thanks, >>>> Prasanna. >>>> >>>