Re: possibly a Clojure question or possibly an AWS question: slow writes to durable-queue

Ravindra Jaju Fri, 13 Oct 2017 18:59:17 -0700

I've faced this issue before, and I'd hazard a guess that like me, it could
be related to the IOPS settings.


AWS' IO performance is highly variable, and you get good performance in
bursts, in very low volumes, and generally only initially for long runs.
Which kind of makes sense, unless you reserve.

durable-queue has a good amount of round-tripping for each 'task'
(depending on how often it syncs too). Combine that with your general
applications logging (debug mode? info mode?) and it simply adds up
overwhelming the entire IO system.

Our volumes were large (both data-, and number of instances-wise). My
workaround was to create a RAM-based filesystem and hold the durable-queue
on it. Which is okay for our use-case because if the machine were to be
killed, we didn't care about the "durability" across instance reboots as
such. Our perf issues as far as durable-queues go vanished! As an aside,
reducing the log-levels too affected app performance on AWS, for the
default config instances.

Memory consumption, though an indicator of problems, isn't necessarily a
direct indicator of things. The IO-backlog's backpressure on the entire JVM
runtime is also a possible reason. At least in my case, it went away once
the durable-queue was moved to the RAM-FS.

Good luck!

-- 
jaju

On Fri, Oct 13, 2017 at 10:05 AM, <lawrence.krub...@gmail.com> wrote:

> Following Daniel Compton's suggestion, I turned on logging for GC. I don't
> see it happening more often, but the slow down does seem related to the
> moment when the app hits the maximum memory allowed. It had been running
> with 4G, so I increased that to 7G, so it goes longer now before it hits
> 98% memory usage, but it does hit it eventually and then everything crawls
> to a very slow speed. Not sure how much memory I would have to use to avoid
> using up almost all of the memory. I suppose I'll figure that out via trial
> and error. Until I can figure that out, nearly all other performance tricks
> seems a bit besides the point.
>
>
>
> On Thursday, October 12, 2017 at 9:01:23 PM UTC-4, Nathan Fisher wrote:
>>
>> Hi!
>>
>> Can you change one of the variables? Specifically can you replicate this
>> on your local machine? If it happens locally then I would focus on
>> something in the JVM eco-system.
>>
>> If you can't replicate it locally then it's possibly AWS specific. It
>> sounds like you're using a t2.large or m4.xlarge. If it's the prior you may
>> very well be contending between with your network bandwidth. EC2's host
>> drive (EBS) is a networked drive which is split between your standard
>> network traffic and the drive volume. If that's the issue then you might
>> need to look at provisioned IOPs. A quick(ish) way to test that hypothesis
>> is to provision a host with high networking performance and provisioned
>> IOPs.
>>
>> Cheers,
>> Nathan
>>
>> On Fri, 13 Oct 2017 at 00:05 <lawrence...@gmail.com> wrote:
>>
>>> Daniel Compton, good suggestion. I've increased the memory to see if I
>>> can postpone the GCs, and I'll log that more carefully.
>>>
>>>
>>> On Wednesday, October 11, 2017 at 8:35:44 PM UTC-4, Daniel Compton wrote:
>>>
>>>> Without more information it's hard to tell, but this looks a like it
>>>> could be a garbage collection issue. Can you run your test again and add
>>>> some logging/monitoring to show each garbage collection? If my hunch is
>>>> right, you'll see garbage collections getting more and more frequent until
>>>> they take up nearly all the CPU time, preventing much forward progress
>>>> writing to the queue.
>>>>
>>>> If it's AWS based throttling, then CloudWatch monitoring
>>>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuid
>>>> e/monitoring-volume-status.html#using_cloudwatch_ebs might show you
>>>> some hints. You could also test with an NVMe drive attached, just to see if
>>>> disk bandwidth is the issue.
>>>>
>>>> On Thu, Oct 12, 2017 at 11:58 AM Justin Smith <noise...@gmail.com>
>>>> wrote:
>>>>
>>> a small thing here, if memory usage is important you should be building
>>>>> and running an uberjar instead of using lein on the server (this also has
>>>>> other benefits), and if you are doing that your project.clj jvm-opts are
>>>>> not used, you have to configure your java command line in aws instead
>>>>>
>>>>> On Wed, Oct 11, 2017 at 3:52 PM <lawrence...@gmail.com> wrote:
>>>>>
>>>> I can't figure out if this is a Clojure question or an AWS question.
>>>>>> And if it is a Clojure question, I can't figure out if it is more of a
>>>>>> general JVM question, or if it is specific to some library such as
>>>>>> durable-queue. I can redirect my question elsewhere, if people think this
>>>>>> is an AWS question.
>>>>>>
>>>>>> In my project.clj, I try to give my app a lot of memory:
>>>>>>
>>>>>>   :jvm-opts ["-Xms7g" "-Xmx7g" "-XX:-UseCompressedOops"])
>>>>>>
>>>>>> And the app starts off pulling data from MySQL and writing it to
>>>>>> Durable-Queue at a rapid rate. ( https://github.com/Factual/dur
>>>>>> able-queue )
>>>>>>
>>>>>> I have some logging set up to report every 30 seconds.
>>>>>>
>>>>>> :enqueued 370137,
>>>>>>
>>>>>> 30 seconds later:
>>>>>>
>>>>>> :enqueued 608967,
>>>>>>
>>>>>> 30 seconds later:
>>>>>>
>>>>>> :enqueued 828950,
>>>>>>
>>>>>> It's a dramatic slowdown. The app is initially writing to the queue
>>>>>> at faster than 10,000 documents a second, but it slows steadily, and 
>>>>>> after
>>>>>> 10 minutes it writes less than 1,000 documents per second. Since I have 
>>>>>> to
>>>>>> write a few million documents, 10,000 a second is the slowest speed I can
>>>>>> live with.
>>>>>>
>>>>>> The queues are in the /tmp folder of an AWS instance that has plenty
>>>>>> of disk space, 4 CPUs, and 16 gigs of RAM.
>>>>>>
>>>>>> Why does the app slow down so much? I had 4 thoughts:
>>>>>>
>>>>>> 1.) the app struggles as it hits a memory limit
>>>>>>
>>>>>> 2.) memory bandwidth is the problem
>>>>>>
>>>>>> 3.) AWS is enforcing some weird IOPS limit
>>>>>>
>>>>>> 4.) durable-queue is misbehaving
>>>>>>
>>>>>> As to possibility #1, I notice the app starts like this:
>>>>>>
>>>>>> Memory in use (percentage/used/max-heap): (\"66%\" \"2373M\"
>>>>>> \"3568M\")
>>>>>>
>>>>>> but 60 seconds later I see:
>>>>>>
>>>>>> Memory in use (percentage/used/max-heap): (\"94%\" \"3613M\"
>>>>>> \"3819M\")
>>>>>>
>>>>>> So I've run out of allowed memory. But why is that? I thought I gave
>>>>>> this app 7 gigs:
>>>>>>
>>>>>>   :jvm-opts ["-Xms7g" "-Xmx7g" "-XX:-UseCompressedOops"])
>>>>>>
>>>>>> As to possibility #2, I found this old post on the Clojure mailist:
>>>>>>
>>>>>> Andy Fingerhut wrote, "one thing I've found in the past on a 2-core
>>>>>> machine that was achieving much less than 2x speedup was memory bandwidth
>>>>>> being the limiting factor."
>>>>>>
>>>>>> https://groups.google.com/forum/#!searchin/clojure/xmx$20xms
>>>>>> $20maximum%7Csort:relevance/clojure/48W2eff3caU/HS6u547gtrAJ
>>>>>>
>>>>>> But I am not sure how to test this.
>>>>>>
>>>>>> As to possibility #3, I'm not sure how AWS enforces its IOPS limits.
>>>>>> If people think this is the most likely possibility, then I will repost 
>>>>>> my
>>>>>> question in an AWS forum.
>>>>>>
>>>>>> As to possibility #4, durable-queue is well-tested and used in a lot
>>>>>> of projects, and Zach Tellman is smart and makes few mistakes, so I'm
>>>>>> doubtful that it is to blame, but I do notice that it starts off with 4
>>>>>> active slabs and then after 120 seconds, it is only using 1 slab. Is that
>>>>>> expected? If people think this is the possible problem then I'll ask
>>>>>> somewhere specific to durable-queue
>>>>>>
>>>>>> Overall, my log information looks like this:
>>>>>>
>>>>>>     ("\nStats about from-mysql-to-tables-queue: " {"message"
>>>>>> {:num-slabs 3, :num-active-slabs 2, :enqueued 370137, :retried 0,
>>>>>> :completed 369934, :in-progress 10}})
>>>>>>
>>>>>>     ("\nResource usage: " "Memory in use (percentage/used/max-heap):
>>>>>> (\"66%\" \"2373M\" \"3568M\")\n\nCPU usage (how-many-cpu's/load-average):
>>>>>>  [4 5.05]\n\nFree memory in jvm: [1171310752]")
>>>>>>
>>>>>> 30 seconds later
>>>>>>
>>>>>>     ("\nStats about from-mysql-to-tables-queue: " {"message"
>>>>>> {:num-slabs 4, :num-active-slabs 4, :enqueued 608967, :retried 0,
>>>>>> :completed 608511, :in-progress 10}})
>>>>>>
>>>>>>     ("\nResource usage: " "Memory in use (percentage/used/max-heap):
>>>>>> (\"76%\" \"2752M\" \"3611M\")\n\nCPU usage (how-many-cpu's/load-average):
>>>>>>  [4 5.87]\n\nFree memory in jvm: [901122456]")
>>>>>>
>>>>>> 30 seconds later
>>>>>>
>>>>>>     ("\nStats about from-mysql-to-tables-queue: " {"message"
>>>>>> {:num-slabs 4, :num-active-slabs 3, :enqueued 828950, :retried 0,
>>>>>> :completed 828470, :in-progress 10}})
>>>>>>
>>>>>>     ("\nResource usage: " "Memory in use (percentage/used/max-heap):
>>>>>> (\"94%\" \"3613M\" \"3819M\")\n\nCPU usage (how-many-cpu's/load-average):
>>>>>>  [4 6.5]\n\nFree memory in jvm: [216459664]")
>>>>>>
>>>>>> 30 seconds later
>>>>>>
>>>>>>     ("\nStats about from-mysql-to-tables-queue: " {"message"
>>>>>> {:num-slabs 1, :num-active-slabs 1, :enqueued 1051974, :retried 0,
>>>>>> :completed 1051974, :in-progress 0}})
>>>>>>
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Clojure" group.
>>>>>>
>>>>> To post to this group, send email to clo...@googlegroups.com
>>>>>
>>>>>
>>>>>> Note that posts from new members are moderated - please be patient
>>>>>> with your first post.
>>>>>> To unsubscribe from this group, send email to
>>>>>>
>>>>> clojure+u...@googlegroups.com
>>>>>
>>>>>
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/clojure?hl=en
>>>>>> ---
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Clojure" group.
>>>>>>
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>>> an email to clojure+u...@googlegroups.com.
>>>>>
>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Clojure" group.
>>>>>
>>>> To post to this group, send email to clo...@googlegroups.com
>>>>
>>>>
>>>>> Note that posts from new members are moderated - please be patient
>>>>> with your first post.
>>>>> To unsubscribe from this group, send email to
>>>>>
>>>> clojure+u...@googlegroups.com
>>>>
>>>>
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/clojure?hl=en
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Clojure" group.
>>>>>
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to clojure+u...@googlegroups.com.
>>>>
>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clo...@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+u...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to clojure+u...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> - sent from my mobile
>>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: possibly a Clojure question or possibly an AWS question: slow writes to durable-queue

Reply via email to