Hi,

This is starting to get into dev list territory.

Interesting idea to touch every 4K page you are going to read.

You could use this to minimize the cost.
http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652

Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
with out prefetching though.

There is a system call to page the memory in which might be better for
larger reads. Still no guarantee things stay cached though.

Ariel


On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> I haven’t studied the read path that carefully, but there might be a spot at 
> the C* level rather than JVM level where you could effectively do a JNI touch 
> of the mmap region you’re going to need next.
> 
>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote:
>> 
>> We don’t use Azul’s Zing, but it does have the nice feature that all threads 
>> don’t have to reach safepoints at the same time. That said we make heavy use 
>> of Cassandra (with off heap memtables - not directly related but allows us a 
>> lot more GC headroom) and SOLR where we switched to mmap because it FAR out 
>> performed pread variants - in no cases have we noticed long time to safe 
>> point (then again our IO is lightning fast).
>> 
>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>> 
>>> Linux automatically uses free memory as cache.  It's not swap.
>>> 
>>> http://www.tldp.org/LDP/lki/lki-4.html
>>> 
>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com> 
>>> wrote:
>>>> __
>>>> Sorry, I don't catch something. What page (memory) cache can exist if 
>>>> there is no swap file.
>>>> Where are those page written/read?
>>>> 
>>>> 
>>>> Best regards, Vladimir Yudovin, 
>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
>>>> Azure and SoftLayer.
>>>> Launch your cluster in minutes.
*
>>>> 
>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel 
>>>> Weisberg<ar...@weisberg.ws>* wrote ---- 
>>>>> Hi,
>>>>> 
>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains using 
>>>>> free memory a file cache. It uses free (and some of the time not so 
>>>>> free!) memory to buffer writes and to cache recently written/read data.
>>>>> 
>>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>>> 
>>>>> When Linux decides it needs free memory it can either evict stuff from 
>>>>> the page cache, flush dirty pages and then evict, or swap anonymous 
>>>>> memory out. When you disable swap you only disable the last behavior.
>>>>> 
>>>>> Maybe we are talking at cross purposes? What I meant is that increasing 
>>>>> the heap size to reduce GC frequency is a legitimate thing to do and it 
>>>>> does have an impact on the performance of the page cache even if you have 
>>>>> swap disabled?
>>>>> 
>>>>> Ariel
>>>>> 
>>>>> 
>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>>>>> >Page cache is data pending flush to disk and data cached from disk.
>>>>>> 
>>>>>> Do you mean file cache?
>>>>>> 
>>>>>> 
>>>>>> Best regards, Vladimir Yudovin, 
>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
>>>>>> Azure and SoftLayer.
>>>>>> Launch your cluster in minutes.*
>>>>>> 
>>>>>> 
>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg 
>>>>>> <ar...@weisberg.ws>* wrote ---- 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Page cache is in use even if you disable swap. Swap is anonymous 
>>>>>>> memory, and whatever else the Linux kernel supports paging out. Page 
>>>>>>> cache is data pending flush to disk and data cached from disk.
>>>>>>> 
>>>>>>> Given how bad the GC pauses are in C* I think it's not the high pole in 
>>>>>>> the tent. Until key things are off heap and C* can run with CMS and get 
>>>>>>> 10 millisecond GCs all day long.
>>>>>>> 
>>>>>>> You can go through tuning and hardware selection try to get more 
>>>>>>> consistent IO pauses and remove outliers as you mention and as a user I 
>>>>>>> think this is your best bet. Generally it's either bad device or 
>>>>>>> filesystem behavior if you get page faults taking more than 200 
>>>>>>> milliseconds O(G1 gc collection).
>>>>>>> 
>>>>>>> I think a JVM change to allow safe points around memory mapped file 
>>>>>>> access is really unlikely although I agree it would be great. I think 
>>>>>>> the best hack around it is to code up your memory mapped file access 
>>>>>>> into JNI methods and find some way to get that to work. Right now if 
>>>>>>> you want to create a safe point a JNI method is the way to do it. The 
>>>>>>> problem is that JNI methods and POJOs don't get along well.
>>>>>>> 
>>>>>>> If you think about it the reason non-memory mapped IO works well is 
>>>>>>> that it's all JNI methods so they don't impact time to safe point. I 
>>>>>>> think there is a tradeoff between tolerance for outliers and 
>>>>>>> performance.
>>>>>>> 
>>>>>>> I don't know the state of the non-memory mapped path and how reliable 
>>>>>>> that is. If it were reliable and I couldn't tolerate the outliers I 
>>>>>>> would use that. I have to ask though, why are you not able to tolerate 
>>>>>>> the outliers? If you are reading and writing at quorum how is this 
>>>>>>> impacting you?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Ariel
>>>>>>> 
>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>>>>>>>> Hi Josh,
>>>>>>>> 
>>>>>>>> >Running with increased heap size would reduce GC frequency, at the 
>>>>>>>> >cost of page cache.
>>>>>>>> 
>>>>>>>> Actually  it's recommended to run C* without virtual memory enabled. 
>>>>>>>> So if there  is no enough memory JVM fails instead of blocking
>>>>>>>> 
>>>>>>>> Best regards, Vladimir Yudovin, 
>>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra 
>>>>>>>> on Azure and SoftLayer.
>>>>>>>> Launch your cluster in minutes.*
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh 
>>>>>>>> Snyder<j...@code406.com>* wrote ---- 
>>>>>>>>> Hello cassandra-users, 
>>>>>>>>> 
>>>>>>>>> I'm investigating an issue with JVMs taking a while to reach a 
>>>>>>>>> safepoint.  I'd 
>>>>>>>>> like the list's input on confirming my hypothesis and finding 
>>>>>>>>> mitigations. 
>>>>>>>>> 
>>>>>>>>> My hypothesis is that slow block devices are causing Cassandra's JVM 
>>>>>>>>> to pause 
>>>>>>>>> completely while attempting to reach a safepoint. 
>>>>>>>>> 
>>>>>>>>> Background: 
>>>>>>>>> 
>>>>>>>>> Hotspot occasionally performs maintenance tasks that necessitate 
>>>>>>>>> stopping all 
>>>>>>>>> of its threads. Threads running JITed code occasionally read from a 
>>>>>>>>> given 
>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint, reading from 
>>>>>>>>> that page 
>>>>>>>>> essentially catapults the thread into purgatory until the safepoint 
>>>>>>>>> completes 
>>>>>>>>> (the mechanism behind this is pretty cool). Threads performing 
>>>>>>>>> syscalls or 
>>>>>>>>> executing native code do this check upon their return into the JVM. 
>>>>>>>>> 
>>>>>>>>> In this way, during the safepoint Hotspot can be sure that all of its 
>>>>>>>>> threads 
>>>>>>>>> are either patiently waiting for safepoint completion or in a system 
>>>>>>>>> call. 
>>>>>>>>> 
>>>>>>>>> Cassandra makes heavy use of mmapped reads in normal operation. When 
>>>>>>>>> doing 
>>>>>>>>> mmapped reads, the JVM executes userspace code to effect a read from 
>>>>>>>>> a file. On 
>>>>>>>>> the fast path (when the page needed is already mapped into the 
>>>>>>>>> process), this 
>>>>>>>>> instruction is very fast. When the page is not cached, the CPU 
>>>>>>>>> triggers a page 
>>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't even 
>>>>>>>>> realize that 
>>>>>>>>> anything interesting is happening: to it, the thread is just 
>>>>>>>>> executing a mov 
>>>>>>>>> instruction that happens to take a while. 
>>>>>>>>> 
>>>>>>>>> The OS, meanwhile, puts the thread in question in the D state 
>>>>>>>>> (assuming Linux, 
>>>>>>>>> here) and goes off to find the desired page. This may take 
>>>>>>>>> microseconds, this 
>>>>>>>>> may take milliseconds, or it may take seconds (or longer). When I/O 
>>>>>>>>> occurs 
>>>>>>>>> while the JVM is trying to enter a safepoint, every thread has to 
>>>>>>>>> wait for the 
>>>>>>>>> laggard I/O to complete. 
>>>>>>>>> 
>>>>>>>>> If you log safepoints with the right options [1], you can see these 
>>>>>>>>> occurrences 
>>>>>>>>> in the JVM output: 
>>>>>>>>> 
>>>>>>>>> > # SafepointSynchronize::begin: Timeout detected: 
>>>>>>>>> > # SafepointSynchronize::begin: Timed out while spinning to reach a 
>>>>>>>>> > safepoint. 
>>>>>>>>> > # SafepointSynchronize::begin: Threads which did not reach the 
>>>>>>>>> > safepoint: 
>>>>>>>>> > # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 
>>>>>>>>> > tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] 
>>>>>>>>> >    java.lang.Thread.State: RUNNABLE 
>>>>>>>>> > 
>>>>>>>>> > # SafepointSynchronize::begin: (End of list) 
>>>>>>>>> >          vmop                    [threads: total initially_running 
>>>>>>>>> > wait_to_block]    [time: spin block sync cleanup vmop] 
>>>>>>>>> > page_trap_count 
>>>>>>>>> > 58099.941: G1IncCollectionPause             [     447          1    
>>>>>>>>> >           1    ]      [  3304     0  3305     1   190    ]  1 
>>>>>>>>> 
>>>>>>>>> If that safepoint happens to be a garbage collection (which this one 
>>>>>>>>> was), you 
>>>>>>>>> can also see it in GC logs: 
>>>>>>>>> 
>>>>>>>>> > 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which 
>>>>>>>>> > application threads were stopped: 3.4971808 seconds, Stopping 
>>>>>>>>> > threads took: 3.3050644 seconds 
>>>>>>>>> 
>>>>>>>>> In this way, JVM safepoints become a powerful weapon for transmuting 
>>>>>>>>> a single 
>>>>>>>>> thread's slow I/O into the entire JVM's lockup. 
>>>>>>>>> 
>>>>>>>>> Does all of the above sound correct? 
>>>>>>>>> 
>>>>>>>>> Mitigations: 
>>>>>>>>> 
>>>>>>>>> 1) don't tolerate block devices that are slow 
>>>>>>>>> 
>>>>>>>>> This is easy in theory, and only somewhat difficult in practice. 
>>>>>>>>> Tools like 
>>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting you know when 
>>>>>>>>> a block 
>>>>>>>>> device is slow. 
>>>>>>>>> 
>>>>>>>>> It is sad, though, because this makes running Cassandra on mixed 
>>>>>>>>> hardware (e.g. 
>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing. 
>>>>>>>>> 
>>>>>>>>> 2) have fewer safepoints 
>>>>>>>>> 
>>>>>>>>> Two of the biggest sources of safepoints are garbage collection and 
>>>>>>>>> revocation 
>>>>>>>>> of biased locks. Evidence points toward biased locking being 
>>>>>>>>> unhelpful for 
>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a 
>>>>>>>>> quick way 
>>>>>>>>> to eliminate one source of safepoints. 
>>>>>>>>> 
>>>>>>>>> Garbage collection, on the other hand, is unavoidable. Running with 
>>>>>>>>> increased 
>>>>>>>>> heap size would reduce GC frequency, at the cost of page cache. But 
>>>>>>>>> sacrificing 
>>>>>>>>> page cache would increase page fault frequency, which is another 
>>>>>>>>> thing we're 
>>>>>>>>> trying to avoid! I don't view this as a serious option. 
>>>>>>>>> 
>>>>>>>>> 3) use a different IO strategy 
>>>>>>>>> 
>>>>>>>>> Looking at the Cassandra source code, there appears to be an 
>>>>>>>>> un(der)documented 
>>>>>>>>> configuration parameter called disk_access_mode. It appears that 
>>>>>>>>> changing this 
>>>>>>>>> to 'standard' would switch to using pread() and pwrite() for I/O, 
>>>>>>>>> instead of 
>>>>>>>>> mmap. I imagine there would be a throughput penalty here for the case 
>>>>>>>>> when 
>>>>>>>>> pages are in the disk cache. 
>>>>>>>>> 
>>>>>>>>> Is this a serious option? It seems far too underdocumented to be 
>>>>>>>>> thought of as 
>>>>>>>>> a contender. 
>>>>>>>>> 
>>>>>>>>> 4) modify the JVM 
>>>>>>>>> 
>>>>>>>>> This is a longer term option. For the purposes of safepoints, perhaps 
>>>>>>>>> the JVM 
>>>>>>>>> could treat reads from an mmapped file in the same way it treats 
>>>>>>>>> threads that 
>>>>>>>>> are running JNI code. That is, the safepoint will proceed even though 
>>>>>>>>> the 
>>>>>>>>> reading thread has not "joined in". Upon finishing its mmapped read, 
>>>>>>>>> the 
>>>>>>>>> reading thread would test the safepoint page (check whether a 
>>>>>>>>> safepoint is in 
>>>>>>>>> progress, in other words). 
>>>>>>>>> 
>>>>>>>>> Conclusion: 
>>>>>>>>> 
>>>>>>>>> I don't imagine there's an easy solution here. I plan to go ahead 
>>>>>>>>> with 
>>>>>>>>> mitigation #1: "don't tolerate block devices that are slow", but I'd 
>>>>>>>>> appreciate 
>>>>>>>>> any approach that doesn't require my hardware to be flawless all the 
>>>>>>>>> time. 
>>>>>>>>> 
>>>>>>>>> Josh 
>>>>>>>>> 
>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 
>>>>>>>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
>>>>>>>>> [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop 
>>>>>>> 
>>>>> 
> Email had 1 attachment:


>  * smime.p7s
>   2k (application/pkcs7-signature)

Reply via email to