Re: JVM safepoints, mmap, and slow disks

Graham Sanderson Sat, 08 Oct 2016 17:22:10 -0700

I haven’t studied the read path that carefully, but there might be a spot at 
the C* level rather than JVM level where you could effectively do a JNI touch 
of the mmap region you’re going to need next.


> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote:
> 
> We don’t use Azul’s Zing, but it does have the nice feature that all threads 
> don’t have to reach safepoints at the same time. That said we make heavy use 
> of Cassandra (with off heap memtables - not directly related but allows us a 
> lot more GC headroom) and SOLR where we switched to mmap because it FAR out 
> performed pread variants - in no cases have we noticed long time to safe 
> point (then again our IO is lightning fast).
> 
>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com 
>> <mailto:j...@jonhaddad.com>> wrote:
>> 
>> Linux automatically uses free memory as cache.  It's not swap.
>> 
>> http://www.tldp.org/LDP/lki/lki-4.html 
>> <http://www.tldp.org/LDP/lki/lki-4.html>
>> 
>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com 
>> <mailto:vla...@winguzone.com>> wrote:
>> Sorry, I don't catch something. What page (memory) cache can exist if there 
>> is no swap file.
>> Where are those page written/read?
>> 
>> 
>> Best regards, Vladimir Yudovin, 
>> Winguzone <https://winguzone.com/?from=list> - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.
>> 
>> 
>> 
>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg<ar...@weisberg.ws 
>> <mailto:ar...@weisberg.ws>> wrote ---- 
>> Hi,
>> 
>> Nope I mean page cache. Linux doesn't call the cache it maintains using free 
>> memory a file cache. It uses free (and some of the time not so free!) memory 
>> to buffer writes and to cache recently written/read data.
>> 
>> http://www.tldp.org/LDP/lki/lki-4.html 
>> <http://www.tldp.org/LDP/lki/lki-4.html>
>> 
>> When Linux decides it needs free memory it can either evict stuff from the 
>> page cache, flush dirty pages and then evict, or swap anonymous memory out. 
>> When you disable swap you only disable the last behavior.
>> 
>> Maybe we are talking at cross purposes? What I meant is that increasing the 
>> heap size to reduce GC frequency is a legitimate thing to do and it does 
>> have an impact on the performance of the page cache even if you have swap 
>> disabled?
>> 
>> Ariel
>> 
>> 
>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>> >Page cache is data pending flush to disk and data cached from disk.
>> 
>> Do you mean file cache?
>> 
>> 
>> Best regards, Vladimir Yudovin, 
>> Winguzone <https://winguzone.com/?from=list> - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.
>> 
>> 
>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg <ar...@weisberg.ws 
>> <mailto:ar...@weisberg.ws>> wrote ---- 
>> Hi,
>> 
>> Page cache is in use even if you disable swap. Swap is anonymous memory, and 
>> whatever else the Linux kernel supports paging out. Page cache is data 
>> pending flush to disk and data cached from disk.
>> 
>> Given how bad the GC pauses are in C* I think it's not the high pole in the 
>> tent. Until key things are off heap and C* can run with CMS and get 10 
>> millisecond GCs all day long.
>> 
>> You can go through tuning and hardware selection try to get more consistent 
>> IO pauses and remove outliers as you mention and as a user I think this is 
>> your best bet. Generally it's either bad device or filesystem behavior if 
>> you get page faults taking more than 200 milliseconds O(G1 gc collection).
>> 
>> I think a JVM change to allow safe points around memory mapped file access 
>> is really unlikely although I agree it would be great. I think the best hack 
>> around it is to code up your memory mapped file access into JNI methods and 
>> find some way to get that to work. Right now if you want to create a safe 
>> point a JNI method is the way to do it. The problem is that JNI methods and 
>> POJOs don't get along well.
>> 
>> If you think about it the reason non-memory mapped IO works well is that 
>> it's all JNI methods so they don't impact time to safe point. I think there 
>> is a tradeoff between tolerance for outliers and performance.
>> 
>> I don't know the state of the non-memory mapped path and how reliable that 
>> is. If it were reliable and I couldn't tolerate the outliers I would use 
>> that. I have to ask though, why are you not able to tolerate the outliers? 
>> If you are reading and writing at quorum how is this impacting you?
>> 
>> Regards,
>> Ariel
>> 
>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>> Hi Josh,
>> 
>> >Running with increased heap size would reduce GC frequency, at the cost of 
>> >page cache.
>> 
>> Actually it's recommended to run C* without virtual memory enabled. So if 
>> there is no enough memory JVM fails instead of blocking
>> 
>> Best regards, Vladimir Yudovin, 
>> Winguzone <https://winguzone.com/?from=list> - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.
>> 
>> 
>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder<j...@code406.com 
>> <mailto:j...@code406.com>> wrote ---- 
>> Hello cassandra-users, 
>> 
>> I'm investigating an issue with JVMs taking a while to reach a safepoint. 
>> I'd 
>> like the list's input on confirming my hypothesis and finding mitigations. 
>> 
>> My hypothesis is that slow block devices are causing Cassandra's JVM to 
>> pause 
>> completely while attempting to reach a safepoint. 
>> 
>> Background: 
>> 
>> Hotspot occasionally performs maintenance tasks that necessitate stopping 
>> all 
>> of its threads. Threads running JITed code occasionally read from a given 
>> safepoint page. If Hotspot has initiated a safepoint, reading from that page 
>> essentially catapults the thread into purgatory until the safepoint 
>> completes 
>> (the mechanism behind this is pretty cool). Threads performing syscalls or 
>> executing native code do this check upon their return into the JVM. 
>> 
>> In this way, during the safepoint Hotspot can be sure that all of its 
>> threads 
>> are either patiently waiting for safepoint completion or in a system call. 
>> 
>> Cassandra makes heavy use of mmapped reads in normal operation. When doing 
>> mmapped reads, the JVM executes userspace code to effect a read from a file. 
>> On 
>> the fast path (when the page needed is already mapped into the process), 
>> this 
>> instruction is very fast. When the page is not cached, the CPU triggers a 
>> page 
>> fault and asks the OS to go fetch the page. The JVM doesn't even realize 
>> that 
>> anything interesting is happening: to it, the thread is just executing a mov 
>> instruction that happens to take a while. 
>> 
>> The OS, meanwhile, puts the thread in question in the D state (assuming 
>> Linux, 
>> here) and goes off to find the desired page. This may take microseconds, 
>> this 
>> may take milliseconds, or it may take seconds (or longer). When I/O occurs 
>> while the JVM is trying to enter a safepoint, every thread has to wait for 
>> the 
>> laggard I/O to complete. 
>> 
>> If you log safepoints with the right options [1], you can see these 
>> occurrences 
>> in the JVM output: 
>> 
>> > # SafepointSynchronize::begin: Timeout detected: 
>> > # SafepointSynchronize::begin: Timed out while spinning to reach a 
>> > safepoint. 
>> > # SafepointSynchronize::begin: Threads which did not reach the safepoint: 
>> > # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 
>> > tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] 
>> > java.lang.Thread.State: RUNNABLE 
>> > 
>> > # SafepointSynchronize::begin: (End of list) 
>> > vmop [threads: total initially_running wait_to_block] [time: spin block 
>> > sync cleanup vmop] page_trap_count 
>> > 58099.941: G1IncCollectionPause [ 447 1 1 ] [ 3304 0 3305 1 190 ] 1 
>> 
>> If that safepoint happens to be a garbage collection (which this one was), 
>> you 
>> can also see it in GC logs: 
>> 
>> > 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which application 
>> > threads were stopped: 3.4971808 seconds, Stopping threads took: 3.3050644 
>> > seconds 
>> 
>> In this way, JVM safepoints become a powerful weapon for transmuting a 
>> single 
>> thread's slow I/O into the entire JVM's lockup. 
>> 
>> Does all of the above sound correct? 
>> 
>> Mitigations: 
>> 
>> 1) don't tolerate block devices that are slow 
>> 
>> This is easy in theory, and only somewhat difficult in practice. Tools like 
>> perf and iosnoop [2] can do pretty good jobs of letting you know when a 
>> block 
>> device is slow. 
>> 
>> It is sad, though, because this makes running Cassandra on mixed hardware 
>> (e.g. 
>> fast SSD and slow disks in a JBOD) quite unappetizing. 
>> 
>> 2) have fewer safepoints 
>> 
>> Two of the biggest sources of safepoints are garbage collection and 
>> revocation 
>> of biased locks. Evidence points toward biased locking being unhelpful for 
>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a quick 
>> way 
>> to eliminate one source of safepoints. 
>> 
>> Garbage collection, on the other hand, is unavoidable. Running with 
>> increased 
>> heap size would reduce GC frequency, at the cost of page cache. But 
>> sacrificing 
>> page cache would increase page fault frequency, which is another thing we're 
>> trying to avoid! I don't view this as a serious option. 
>> 
>> 3) use a different IO strategy 
>> 
>> Looking at the Cassandra source code, there appears to be an 
>> un(der)documented 
>> configuration parameter called disk_access_mode. It appears that changing 
>> this 
>> to 'standard' would switch to using pread() and pwrite() for I/O, instead of 
>> mmap. I imagine there would be a throughput penalty here for the case when 
>> pages are in the disk cache. 
>> 
>> Is this a serious option? It seems far too underdocumented to be thought of 
>> as 
>> a contender. 
>> 
>> 4) modify the JVM 
>> 
>> This is a longer term option. For the purposes of safepoints, perhaps the 
>> JVM 
>> could treat reads from an mmapped file in the same way it treats threads 
>> that 
>> are running JNI code. That is, the safepoint will proceed even though the 
>> reading thread has not "joined in". Upon finishing its mmapped read, the 
>> reading thread would test the safepoint page (check whether a safepoint is 
>> in 
>> progress, in other words). 
>> 
>> Conclusion: 
>> 
>> I don't imagine there's an easy solution here. I plan to go ahead with 
>> mitigation #1: "don't tolerate block devices that are slow", but I'd 
>> appreciate 
>> any approach that doesn't require my hardware to be flawless all the time. 
>> 
>> Josh 
>> 
>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 
>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
>> [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop 
>> <https://github.com/brendangregg/perf-tools/blob/master/iosnoop> 
>> 
>> 
>> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: JVM safepoints, mmap, and slow disks

Reply via email to