Re: JVM safepoints, mmap, and slow disks

Benedict Elliott Smith Sun, 09 Oct 2016 02:41:03 -0700

The biggest problem with pread was the issue of over reading (reading 64k
where 4k would suffice), which was significantly improved in 2.2 iirc. I
don't think the penalty is very significant anymore, and if you are
experiencing time to safe point issues it's very likely a worthwhile switch
to flip.


On Sunday, 9 October 2016, Graham Sanderson <gra...@vast.com> wrote:

> I was using the term “touch” loosely to hopefully mean pre-fetch, though I
> suspect (I think intel has been de-emphasizing) you can still do a sensible
> prefetch instruction in native code. Even if not you are still better
> blocking in JNI code - I haven’t looked at the link to see if the correct
> barriers are enforced by the sun-misc-unsafe method.
>
> I do suspect that you’ll see up to about 5-10% sys call overhead if you
> hit pread.
>
> > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg <ar...@weisberg.ws
> <javascript:;>> wrote:
> >
> > Hi,
> >
> > This is starting to get into dev list territory.
> >
> > Interesting idea to touch every 4K page you are going to read.
> >
> > You could use this to minimize the cost.
> > http://stackoverflow.com/questions/36298111/is-it-
> possible-to-use-sun-misc-unsafe-to-call-c-functions-
> without-jni/36309652#36309652
> >
> > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> > with out prefetching though.
> >
> > There is a system call to page the memory in which might be better for
> > larger reads. Still no guarantee things stay cached though.
> >
> > Ariel
> >
> >
> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> >> I haven’t studied the read path that carefully, but there might be a
> spot at the C* level rather than JVM level where you could effectively do a
> JNI touch of the mmap region you’re going to need next.
> >>
> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com
> <javascript:;>> wrote:
> >>>
> >>> We don’t use Azul’s Zing, but it does have the nice feature that all
> threads don’t have to reach safepoints at the same time. That said we make
> heavy use of Cassandra (with off heap memtables - not directly related but
> allows us a lot more GC headroom) and SOLR where we switched to mmap
> because it FAR out performed pread variants - in no cases have we noticed
> long time to safe point (then again our IO is lightning fast).
> >>>
> >>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com
> <javascript:;>> wrote:
> >>>>
> >>>> Linux automatically uses free memory as cache.  It's not swap.
> >>>>
> >>>> http://www.tldp.org/LDP/lki/lki-4.html
> >>>>
> >>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <
> vla...@winguzone.com <javascript:;>> wrote:
> >>>>> __
> >>>>> Sorry, I don't catch something. What page (memory) cache can exist
> if there is no swap file.
> >>>>> Where are those page written/read?
> >>>>>
> >>>>>
> >>>>> Best regards, Vladimir Yudovin,
> >>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> >>>>> Launch your cluster in minutes.
> > *
> >>>>>
> >>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel Weisberg<
> ar...@weisberg.ws <javascript:;>>* wrote ----
> >>>>>> Hi,
> >>>>>>
> >>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains
> using free memory a file cache. It uses free (and some of the time not so
> free!) memory to buffer writes and to cache recently written/read data.
> >>>>>>
> >>>>>> http://www.tldp.org/LDP/lki/lki-4.html
> >>>>>>
> >>>>>> When Linux decides it needs free memory it can either evict stuff
> from the page cache, flush dirty pages and then evict, or swap anonymous
> memory out. When you disable swap you only disable the last behavior.
> >>>>>>
> >>>>>> Maybe we are talking at cross purposes? What I meant is that
> increasing the heap size to reduce GC frequency is a legitimate thing to do
> and it does have an impact on the performance of the page cache even if you
> have swap disabled?
> >>>>>>
> >>>>>> Ariel
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
> >>>>>>>> Page cache is data pending flush to disk and data cached from
> disk.
> >>>>>>>
> >>>>>>> Do you mean file cache?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best regards, Vladimir Yudovin,
> >>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> >>>>>>> Launch your cluster in minutes.*
> >>>>>>>
> >>>>>>>
> >>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg <
> ar...@weisberg.ws <javascript:;>>* wrote ----
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Page cache is in use even if you disable swap. Swap is anonymous
> memory, and whatever else the Linux kernel supports paging out. Page cache
> is data pending flush to disk and data cached from disk.
> >>>>>>>>
> >>>>>>>> Given how bad the GC pauses are in C* I think it's not the high
> pole in the tent. Until key things are off heap and C* can run with CMS and
> get 10 millisecond GCs all day long.
> >>>>>>>>
> >>>>>>>> You can go through tuning and hardware selection try to get more
> consistent IO pauses and remove outliers as you mention and as a user I
> think this is your best bet. Generally it's either bad device or filesystem
> behavior if you get page faults taking more than 200 milliseconds O(G1 gc
> collection).
> >>>>>>>>
> >>>>>>>> I think a JVM change to allow safe points around memory mapped
> file access is really unlikely although I agree it would be great. I think
> the best hack around it is to code up your memory mapped file access into
> JNI methods and find some way to get that to work. Right now if you want to
> create a safe point a JNI method is the way to do it. The problem is that
> JNI methods and POJOs don't get along well.
> >>>>>>>>
> >>>>>>>> If you think about it the reason non-memory mapped IO works well
> is that it's all JNI methods so they don't impact time to safe point. I
> think there is a tradeoff between tolerance for outliers and performance.
> >>>>>>>>
> >>>>>>>> I don't know the state of the non-memory mapped path and how
> reliable that is. If it were reliable and I couldn't tolerate the outliers
> I would use that. I have to ask though, why are you not able to tolerate
> the outliers? If you are reading and writing at quorum how is this
> impacting you?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Ariel
> >>>>>>>>
> >>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
> >>>>>>>>> Hi Josh,
> >>>>>>>>>
> >>>>>>>>>> Running with increased heap size would reduce GC frequency, at
> the cost of page cache.
> >>>>>>>>>
> >>>>>>>>> Actually  it's recommended to run C* without virtual memory
> enabled. So if there  is no enough memory JVM fails instead of blocking
> >>>>>>>>>
> >>>>>>>>> Best regards, Vladimir Yudovin,
> >>>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> >>>>>>>>> Launch your cluster in minutes.*
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh Snyder<
> j...@code406.com <javascript:;>>* wrote ----
> >>>>>>>>>> Hello cassandra-users,
> >>>>>>>>>>
> >>>>>>>>>> I'm investigating an issue with JVMs taking a while to reach a
> safepoint.  I'd
> >>>>>>>>>> like the list's input on confirming my hypothesis and finding
> mitigations.
> >>>>>>>>>>
> >>>>>>>>>> My hypothesis is that slow block devices are causing
> Cassandra's JVM to pause
> >>>>>>>>>> completely while attempting to reach a safepoint.
> >>>>>>>>>>
> >>>>>>>>>> Background:
> >>>>>>>>>>
> >>>>>>>>>> Hotspot occasionally performs maintenance tasks that
> necessitate stopping all
> >>>>>>>>>> of its threads. Threads running JITed code occasionally read
> from a given
> >>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint, reading
> from that page
> >>>>>>>>>> essentially catapults the thread into purgatory until the
> safepoint completes
> >>>>>>>>>> (the mechanism behind this is pretty cool). Threads performing
> syscalls or
> >>>>>>>>>> executing native code do this check upon their return into the
> JVM.
> >>>>>>>>>>
> >>>>>>>>>> In this way, during the safepoint Hotspot can be sure that all
> of its threads
> >>>>>>>>>> are either patiently waiting for safepoint completion or in a
> system call.
> >>>>>>>>>>
> >>>>>>>>>> Cassandra makes heavy use of mmapped reads in normal operation.
> When doing
> >>>>>>>>>> mmapped reads, the JVM executes userspace code to effect a read
> from a file. On
> >>>>>>>>>> the fast path (when the page needed is already mapped into the
> process), this
> >>>>>>>>>> instruction is very fast. When the page is not cached, the CPU
> triggers a page
> >>>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't
> even realize that
> >>>>>>>>>> anything interesting is happening: to it, the thread is just
> executing a mov
> >>>>>>>>>> instruction that happens to take a while.
> >>>>>>>>>>
> >>>>>>>>>> The OS, meanwhile, puts the thread in question in the D state
> (assuming Linux,
> >>>>>>>>>> here) and goes off to find the desired page. This may take
> microseconds, this
> >>>>>>>>>> may take milliseconds, or it may take seconds (or longer). When
> I/O occurs
> >>>>>>>>>> while the JVM is trying to enter a safepoint, every thread has
> to wait for the
> >>>>>>>>>> laggard I/O to complete.
> >>>>>>>>>>
> >>>>>>>>>> If you log safepoints with the right options [1], you can see
> these occurrences
> >>>>>>>>>> in the JVM output:
> >>>>>>>>>>
> >>>>>>>>>>> # SafepointSynchronize::begin: Timeout detected:
> >>>>>>>>>>> # SafepointSynchronize::begin: Timed out while spinning to
> reach a safepoint.
> >>>>>>>>>>> # SafepointSynchronize::begin: Threads which did not reach the
> safepoint:
> >>>>>>>>>>> # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0
> tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000]
> >>>>>>>>>>>   java.lang.Thread.State: RUNNABLE
> >>>>>>>>>>>
> >>>>>>>>>>> # SafepointSynchronize::begin: (End of list)
> >>>>>>>>>>>         vmop                    [threads: total
> initially_running wait_to_block]    [time: spin block sync cleanup vmop]
> page_trap_count
> >>>>>>>>>>> 58099.941: G1IncCollectionPause             [     447
> 1              1    ]      [  3304     0  3305     1   190    ]  1
> >>>>>>>>>>
> >>>>>>>>>> If that safepoint happens to be a garbage collection (which
> this one was), you
> >>>>>>>>>> can also see it in GC logs:
> >>>>>>>>>>
> >>>>>>>>>>> 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which
> application threads were stopped: 3.4971808 seconds, Stopping threads took:
> 3.3050644 seconds
> >>>>>>>>>>
> >>>>>>>>>> In this way, JVM safepoints become a powerful weapon for
> transmuting a single
> >>>>>>>>>> thread's slow I/O into the entire JVM's lockup.
> >>>>>>>>>>
> >>>>>>>>>> Does all of the above sound correct?
> >>>>>>>>>>
> >>>>>>>>>> Mitigations:
> >>>>>>>>>>
> >>>>>>>>>> 1) don't tolerate block devices that are slow
> >>>>>>>>>>
> >>>>>>>>>> This is easy in theory, and only somewhat difficult in
> practice. Tools like
> >>>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting you
> know when a block
> >>>>>>>>>> device is slow.
> >>>>>>>>>>
> >>>>>>>>>> It is sad, though, because this makes running Cassandra on
> mixed hardware (e.g.
> >>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing.
> >>>>>>>>>>
> >>>>>>>>>> 2) have fewer safepoints
> >>>>>>>>>>
> >>>>>>>>>> Two of the biggest sources of safepoints are garbage collection
> and revocation
> >>>>>>>>>> of biased locks. Evidence points toward biased locking being
> unhelpful for
> >>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking)
> is a quick way
> >>>>>>>>>> to eliminate one source of safepoints.
> >>>>>>>>>>
> >>>>>>>>>> Garbage collection, on the other hand, is unavoidable. Running
> with increased
> >>>>>>>>>> heap size would reduce GC frequency, at the cost of page cache.
> But sacrificing
> >>>>>>>>>> page cache would increase page fault frequency, which is
> another thing we're
> >>>>>>>>>> trying to avoid! I don't view this as a serious option.
> >>>>>>>>>>
> >>>>>>>>>> 3) use a different IO strategy
> >>>>>>>>>>
> >>>>>>>>>> Looking at the Cassandra source code, there appears to be an
> un(der)documented
> >>>>>>>>>> configuration parameter called disk_access_mode. It appears
> that changing this
> >>>>>>>>>> to 'standard' would switch to using pread() and pwrite() for
> I/O, instead of
> >>>>>>>>>> mmap. I imagine there would be a throughput penalty here for
> the case when
> >>>>>>>>>> pages are in the disk cache.
> >>>>>>>>>>
> >>>>>>>>>> Is this a serious option? It seems far too underdocumented to
> be thought of as
> >>>>>>>>>> a contender.
> >>>>>>>>>>
> >>>>>>>>>> 4) modify the JVM
> >>>>>>>>>>
> >>>>>>>>>> This is a longer term option. For the purposes of safepoints,
> perhaps the JVM
> >>>>>>>>>> could treat reads from an mmapped file in the same way it
> treats threads that
> >>>>>>>>>> are running JNI code. That is, the safepoint will proceed even
> though the
> >>>>>>>>>> reading thread has not "joined in". Upon finishing its mmapped
> read, the
> >>>>>>>>>> reading thread would test the safepoint page (check whether a
> safepoint is in
> >>>>>>>>>> progress, in other words).
> >>>>>>>>>>
> >>>>>>>>>> Conclusion:
> >>>>>>>>>>
> >>>>>>>>>> I don't imagine there's an easy solution here. I plan to go
> ahead with
> >>>>>>>>>> mitigation #1: "don't tolerate block devices that are slow",
> but I'd appreciate
> >>>>>>>>>> any approach that doesn't require my hardware to be flawless
> all the time.
> >>>>>>>>>>
> >>>>>>>>>> Josh
> >>>>>>>>>>
> >>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100
> >>>>>>>>>> -XX:+PrintSafepointStatistics -XX:
> PrintSafepointStatisticsCount=1
> >>>>>>>>>> [2] https://github.com/brendangregg/perf-tools/blob/
> master/iosnoop
> >>>>>>>>
> >>>>>>
> >> Email had 1 attachment:
> >
> >
> >> * smime.p7s
> >>   2k (application/pkcs7-signature)
>
>

Re: JVM safepoints, mmap, and slow disks

Reply via email to