Hi, This is starting to get into dev list territory.
Interesting idea to touch every 4K page you are going to read. You could use this to minimize the cost. http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652 Maybe faster than doing buffered IO. It's a lot of cache and TLB misses with out prefetching though. There is a system call to page the memory in which might be better for larger reads. Still no guarantee things stay cached though. Ariel On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote: > I haven’t studied the read path that carefully, but there might be a spot at > the C* level rather than JVM level where you could effectively do a JNI touch > of the mmap region you’re going to need next. > >> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote: >> >> We don’t use Azul’s Zing, but it does have the nice feature that all threads >> don’t have to reach safepoints at the same time. That said we make heavy use >> of Cassandra (with off heap memtables - not directly related but allows us a >> lot more GC headroom) and SOLR where we switched to mmap because it FAR out >> performed pread variants - in no cases have we noticed long time to safe >> point (then again our IO is lightning fast). >> >>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com> wrote: >>> >>> Linux automatically uses free memory as cache. It's not swap. >>> >>> http://www.tldp.org/LDP/lki/lki-4.html >>> >>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com> >>> wrote: >>>> __ >>>> Sorry, I don't catch something. What page (memory) cache can exist if >>>> there is no swap file. >>>> Where are those page written/read? >>>> >>>> >>>> Best regards, Vladimir Yudovin, >>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on >>>> Azure and SoftLayer. >>>> Launch your cluster in minutes. * >>>> >>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel >>>> Weisberg<ar...@weisberg.ws>* wrote ---- >>>>> Hi, >>>>> >>>>> Nope I mean page cache. Linux doesn't call the cache it maintains using >>>>> free memory a file cache. It uses free (and some of the time not so >>>>> free!) memory to buffer writes and to cache recently written/read data. >>>>> >>>>> http://www.tldp.org/LDP/lki/lki-4.html >>>>> >>>>> When Linux decides it needs free memory it can either evict stuff from >>>>> the page cache, flush dirty pages and then evict, or swap anonymous >>>>> memory out. When you disable swap you only disable the last behavior. >>>>> >>>>> Maybe we are talking at cross purposes? What I meant is that increasing >>>>> the heap size to reduce GC frequency is a legitimate thing to do and it >>>>> does have an impact on the performance of the page cache even if you have >>>>> swap disabled? >>>>> >>>>> Ariel >>>>> >>>>> >>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote: >>>>>> >Page cache is data pending flush to disk and data cached from disk. >>>>>> >>>>>> Do you mean file cache? >>>>>> >>>>>> >>>>>> Best regards, Vladimir Yudovin, >>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on >>>>>> Azure and SoftLayer. >>>>>> Launch your cluster in minutes.* >>>>>> >>>>>> >>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg >>>>>> <ar...@weisberg.ws>* wrote ---- >>>>>>> Hi, >>>>>>> >>>>>>> Page cache is in use even if you disable swap. Swap is anonymous >>>>>>> memory, and whatever else the Linux kernel supports paging out. Page >>>>>>> cache is data pending flush to disk and data cached from disk. >>>>>>> >>>>>>> Given how bad the GC pauses are in C* I think it's not the high pole in >>>>>>> the tent. Until key things are off heap and C* can run with CMS and get >>>>>>> 10 millisecond GCs all day long. >>>>>>> >>>>>>> You can go through tuning and hardware selection try to get more >>>>>>> consistent IO pauses and remove outliers as you mention and as a user I >>>>>>> think this is your best bet. Generally it's either bad device or >>>>>>> filesystem behavior if you get page faults taking more than 200 >>>>>>> milliseconds O(G1 gc collection). >>>>>>> >>>>>>> I think a JVM change to allow safe points around memory mapped file >>>>>>> access is really unlikely although I agree it would be great. I think >>>>>>> the best hack around it is to code up your memory mapped file access >>>>>>> into JNI methods and find some way to get that to work. Right now if >>>>>>> you want to create a safe point a JNI method is the way to do it. The >>>>>>> problem is that JNI methods and POJOs don't get along well. >>>>>>> >>>>>>> If you think about it the reason non-memory mapped IO works well is >>>>>>> that it's all JNI methods so they don't impact time to safe point. I >>>>>>> think there is a tradeoff between tolerance for outliers and >>>>>>> performance. >>>>>>> >>>>>>> I don't know the state of the non-memory mapped path and how reliable >>>>>>> that is. If it were reliable and I couldn't tolerate the outliers I >>>>>>> would use that. I have to ask though, why are you not able to tolerate >>>>>>> the outliers? If you are reading and writing at quorum how is this >>>>>>> impacting you? >>>>>>> >>>>>>> Regards, >>>>>>> Ariel >>>>>>> >>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote: >>>>>>>> Hi Josh, >>>>>>>> >>>>>>>> >Running with increased heap size would reduce GC frequency, at the >>>>>>>> >cost of page cache. >>>>>>>> >>>>>>>> Actually it's recommended to run C* without virtual memory enabled. >>>>>>>> So if there is no enough memory JVM fails instead of blocking >>>>>>>> >>>>>>>> Best regards, Vladimir Yudovin, >>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra >>>>>>>> on Azure and SoftLayer. >>>>>>>> Launch your cluster in minutes.* >>>>>>>> >>>>>>>> >>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh >>>>>>>> Snyder<j...@code406.com>* wrote ---- >>>>>>>>> Hello cassandra-users, >>>>>>>>> >>>>>>>>> I'm investigating an issue with JVMs taking a while to reach a >>>>>>>>> safepoint. I'd >>>>>>>>> like the list's input on confirming my hypothesis and finding >>>>>>>>> mitigations. >>>>>>>>> >>>>>>>>> My hypothesis is that slow block devices are causing Cassandra's JVM >>>>>>>>> to pause >>>>>>>>> completely while attempting to reach a safepoint. >>>>>>>>> >>>>>>>>> Background: >>>>>>>>> >>>>>>>>> Hotspot occasionally performs maintenance tasks that necessitate >>>>>>>>> stopping all >>>>>>>>> of its threads. Threads running JITed code occasionally read from a >>>>>>>>> given >>>>>>>>> safepoint page. If Hotspot has initiated a safepoint, reading from >>>>>>>>> that page >>>>>>>>> essentially catapults the thread into purgatory until the safepoint >>>>>>>>> completes >>>>>>>>> (the mechanism behind this is pretty cool). Threads performing >>>>>>>>> syscalls or >>>>>>>>> executing native code do this check upon their return into the JVM. >>>>>>>>> >>>>>>>>> In this way, during the safepoint Hotspot can be sure that all of its >>>>>>>>> threads >>>>>>>>> are either patiently waiting for safepoint completion or in a system >>>>>>>>> call. >>>>>>>>> >>>>>>>>> Cassandra makes heavy use of mmapped reads in normal operation. When >>>>>>>>> doing >>>>>>>>> mmapped reads, the JVM executes userspace code to effect a read from >>>>>>>>> a file. On >>>>>>>>> the fast path (when the page needed is already mapped into the >>>>>>>>> process), this >>>>>>>>> instruction is very fast. When the page is not cached, the CPU >>>>>>>>> triggers a page >>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't even >>>>>>>>> realize that >>>>>>>>> anything interesting is happening: to it, the thread is just >>>>>>>>> executing a mov >>>>>>>>> instruction that happens to take a while. >>>>>>>>> >>>>>>>>> The OS, meanwhile, puts the thread in question in the D state >>>>>>>>> (assuming Linux, >>>>>>>>> here) and goes off to find the desired page. This may take >>>>>>>>> microseconds, this >>>>>>>>> may take milliseconds, or it may take seconds (or longer). When I/O >>>>>>>>> occurs >>>>>>>>> while the JVM is trying to enter a safepoint, every thread has to >>>>>>>>> wait for the >>>>>>>>> laggard I/O to complete. >>>>>>>>> >>>>>>>>> If you log safepoints with the right options [1], you can see these >>>>>>>>> occurrences >>>>>>>>> in the JVM output: >>>>>>>>> >>>>>>>>> > # SafepointSynchronize::begin: Timeout detected: >>>>>>>>> > # SafepointSynchronize::begin: Timed out while spinning to reach a >>>>>>>>> > safepoint. >>>>>>>>> > # SafepointSynchronize::begin: Threads which did not reach the >>>>>>>>> > safepoint: >>>>>>>>> > # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 >>>>>>>>> > tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] >>>>>>>>> > java.lang.Thread.State: RUNNABLE >>>>>>>>> > >>>>>>>>> > # SafepointSynchronize::begin: (End of list) >>>>>>>>> > vmop [threads: total initially_running >>>>>>>>> > wait_to_block] [time: spin block sync cleanup vmop] >>>>>>>>> > page_trap_count >>>>>>>>> > 58099.941: G1IncCollectionPause [ 447 1 >>>>>>>>> > 1 ] [ 3304 0 3305 1 190 ] 1 >>>>>>>>> >>>>>>>>> If that safepoint happens to be a garbage collection (which this one >>>>>>>>> was), you >>>>>>>>> can also see it in GC logs: >>>>>>>>> >>>>>>>>> > 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which >>>>>>>>> > application threads were stopped: 3.4971808 seconds, Stopping >>>>>>>>> > threads took: 3.3050644 seconds >>>>>>>>> >>>>>>>>> In this way, JVM safepoints become a powerful weapon for transmuting >>>>>>>>> a single >>>>>>>>> thread's slow I/O into the entire JVM's lockup. >>>>>>>>> >>>>>>>>> Does all of the above sound correct? >>>>>>>>> >>>>>>>>> Mitigations: >>>>>>>>> >>>>>>>>> 1) don't tolerate block devices that are slow >>>>>>>>> >>>>>>>>> This is easy in theory, and only somewhat difficult in practice. >>>>>>>>> Tools like >>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting you know when >>>>>>>>> a block >>>>>>>>> device is slow. >>>>>>>>> >>>>>>>>> It is sad, though, because this makes running Cassandra on mixed >>>>>>>>> hardware (e.g. >>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing. >>>>>>>>> >>>>>>>>> 2) have fewer safepoints >>>>>>>>> >>>>>>>>> Two of the biggest sources of safepoints are garbage collection and >>>>>>>>> revocation >>>>>>>>> of biased locks. Evidence points toward biased locking being >>>>>>>>> unhelpful for >>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a >>>>>>>>> quick way >>>>>>>>> to eliminate one source of safepoints. >>>>>>>>> >>>>>>>>> Garbage collection, on the other hand, is unavoidable. Running with >>>>>>>>> increased >>>>>>>>> heap size would reduce GC frequency, at the cost of page cache. But >>>>>>>>> sacrificing >>>>>>>>> page cache would increase page fault frequency, which is another >>>>>>>>> thing we're >>>>>>>>> trying to avoid! I don't view this as a serious option. >>>>>>>>> >>>>>>>>> 3) use a different IO strategy >>>>>>>>> >>>>>>>>> Looking at the Cassandra source code, there appears to be an >>>>>>>>> un(der)documented >>>>>>>>> configuration parameter called disk_access_mode. It appears that >>>>>>>>> changing this >>>>>>>>> to 'standard' would switch to using pread() and pwrite() for I/O, >>>>>>>>> instead of >>>>>>>>> mmap. I imagine there would be a throughput penalty here for the case >>>>>>>>> when >>>>>>>>> pages are in the disk cache. >>>>>>>>> >>>>>>>>> Is this a serious option? It seems far too underdocumented to be >>>>>>>>> thought of as >>>>>>>>> a contender. >>>>>>>>> >>>>>>>>> 4) modify the JVM >>>>>>>>> >>>>>>>>> This is a longer term option. For the purposes of safepoints, perhaps >>>>>>>>> the JVM >>>>>>>>> could treat reads from an mmapped file in the same way it treats >>>>>>>>> threads that >>>>>>>>> are running JNI code. That is, the safepoint will proceed even though >>>>>>>>> the >>>>>>>>> reading thread has not "joined in". Upon finishing its mmapped read, >>>>>>>>> the >>>>>>>>> reading thread would test the safepoint page (check whether a >>>>>>>>> safepoint is in >>>>>>>>> progress, in other words). >>>>>>>>> >>>>>>>>> Conclusion: >>>>>>>>> >>>>>>>>> I don't imagine there's an easy solution here. I plan to go ahead >>>>>>>>> with >>>>>>>>> mitigation #1: "don't tolerate block devices that are slow", but I'd >>>>>>>>> appreciate >>>>>>>>> any approach that doesn't require my hardware to be flawless all the >>>>>>>>> time. >>>>>>>>> >>>>>>>>> Josh >>>>>>>>> >>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 >>>>>>>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 >>>>>>>>> [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop >>>>>>> >>>>> > Email had 1 attachment: > * smime.p7s > 2k (application/pkcs7-signature)