JIT was precisely the issue I was thinking was causing this. One thing may be 
necessary, that is to ensure you sync the disk image before taking your 
checkpoint.

gem5’s debug flags should help you identify something like a hang, for example 
an ExecAll trace. A SyscallAll trace would most likely help you understand 
better what the JIT is doing.

From: gem5-users <[email protected]> On Behalf Of Da Zhang
Sent: Thursday, July 19, 2018 11:15 AM
To: gem5 users mailing list <[email protected]>
Subject: Re: [gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" 
and "null exception" during timing mode (fs mode) after restarting from a 
checkpoint

Thanks for the suggestions.
I have been trying a couple of solutions (I only test for  a small subset of 
decapo benchmark suite, which encounters segfault with O3CPU):

1. using TimingSimpleCPU: no segfaults
2. disable COW layer and write on the disk image when taking checkpoint: there 
are still segfaults
3. take checkpoints with JIT compiler disabled (20x slowdown): no segfaults
4. take checkpoints during atomic mode (without warming up JIT): no segfaults
5. take checkpoints with Java OOPs compress disabled: there are still segfaults

One thing that I can't tell is if the benchmark hangs since there is no 
printing during the execution. Is there a statistic I can use to tell if the 
benchmark hangs?

So far, all my experiments are running using 1CPU (even some benchmarks are 
multithreading). I attempted to take some checkpoints with more CPUs with KVM 
CPU. But unfortunately, I got some "rcu_sched self-detected stall on CPU" 
issues. Any idea?

On Mon, Jul 16, 2018 at 5:47 PM Gutierrez, Anthony 
<[email protected]<mailto:[email protected]>> wrote:
Da,

Do you encounter the segfault only when restoring from a checkpoint? That is, 
if you do not use checkpoints can any DaCapo benchmark successfully complete 
under one of the simple CPU models (and not just KVM CPU)?

If so, you may want to get a syscall trace (e.g., using strace) to see what 
sorts of files the JVM is trying to read etc. It’s possible that the VM 
generates some files that it will read back later. If you use checkpoints, due 
to the disk image COW layer, I do not believe any disk updates are 
checkpointed, thus these files will not persist, which could lead to some weird 
segfault issues. Not sure if this is happening in your case, but it may be 
worth investigating.

I created some of the original Android disk images, and the original DaCapo 
image, and at that time I would typically run the benchmarks thru the FS mode 
and Atomic CPU once, with the COW layer disabled, in order to generate the 
needed files on the disk image and have them persist. This was entirely for 
performance, however, to prevent the VMs from regenerating the same files for 
each run, but I can envision it causing issues during runtime as well. In 
particular, it seems you’re code is faulting while doing some XML 
serializing/deserializing, perhaps the xml file it is looking for is gone?

Beyond that, assuming it is a real bug in gem5, I would recommend an ExecAll 
trace to figure out why the instruction at that PC is faulting.

-Tony

From: gem5-users 
[mailto:[email protected]<mailto:[email protected]>] On 
Behalf Of Da Zhang
Sent: Monday, July 16, 2018 1:50 PM
To: gem5 users mailing list <[email protected]<mailto:[email protected]>>
Subject: Re: [gem5-users] dacapo (java) benchmark suite encounters "SIGSEGV" 
and "null exception" during timing mode (fs mode) after restarting from a 
checkpoint

Hey Jason,

There are a bunch of "warn: instruction 'prefetch_nta' unimplemented" in atomic 
modes, during which the java benchmarks don't crash. However, there is no these 
kind of warnings during timing mode. Does it imply that unimplemented 
instructions don't cause the problem? Any clues or suggestions to debug these 
problems?

best,
Da Zhang



On Mon, Jul 16, 2018 at 1:32 PM Jason Lowe-Power 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

Are you seeing any warnings like "warn: Instruction XXX not implemented"?

There are many X86 SIMD instructions that are currently unimplemented. I would 
bet that your application is using some of those instructions and getting 0's 
as the output instead of the correct value.

The "right" way to solve this problem is to implement these instructions (and 
we would really appreciate it if you contribute your fixes back on 
https://gem5-review.googlesource.com. The other option is to recompile your 
applications without SIMD extensions (e.g., -march=athlon64 or whatever is the 
original x86-64 name in GCC). However, this likely requires compiling all of 
the java runtime in your case.

Cheers,
Jason

On Mon, Jul 16, 2018 at 10:11 AM Da Zhang <[email protected]<mailto:[email protected]>> 
wrote:
To clarify, "SIGSEGV and null exceptions " happens to the benchmark suite, not 
gem5. Gem5 is running without errors. But in the system.pc.com_1.device files, 
I observe that most of the benchmarks crash due to SIGSEGV or null exceptions.
Example:
"

 x/system.pc.com_1.device                                                       
                                                                                
                                                                                
                                                                   buffers

  1 #

  2 # A fatal error has been detected by the Java Runtime Environment:

  3 #

  4 #  SIGSEGV (0xb) at pc=0x00007f81d17742b7, pid=1474, tid=0x00007f81cf46d700

  5 #

  6 # JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 
1.8.0_171-b11)

  7 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode 
linux-amd64 compressed oops)

  8 # Problematic frame:

  9 # J 1815 C2 
org.apache.xml.serializer.ToHTMLStream.endElement(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)V
 (389 bytes) @ 0x00007f81d17742b7 [0x00007f81d1774280+0x37]

 10 #

 11 #
"

On Mon, Jul 16, 2018 at 11:39 AM Da Zhang <[email protected]<mailto:[email protected]>> 
wrote:
Hey guys,

I am testing a java benchmark suite, dacapo, on gem5 with fs mode. 
Unfortunately, I encounter a lot of  SIGSEGV and null exceptions during timing 
mode after restarting from the checkpoints.
I am using linux kernel v4.8.13 and ubuntu-server-16.04.1 with oracle jdk 
v8.0_171-b11. To eliminate the influence of my modifications to gem5 src/ and 
configs/, I re-download gem5 and checkout to commit 
"ee2ffdc0fdb489767768e5273a4ccd7b51735c7c", which is the gem5 version I am 
working on. The checkpoint was taken by using kvm cpu with 1 CPU and 16GB 
memory. For the simulation, I use build/X86/gem5.opt (in order to enable 
assertions) with fs mode (configs/example/fs.py). Other options include 
"--cpu-type=DerivO3CPU -n 1 --mem-size=16GB --caches --l2cache 
--l2_size=${L2SIZE}" (I try L2SIZE from 256KB to 8MB). I test with 100ms warmup 
and 1ps real simulation time. There are no errors presented. But with longer 
real simulation time, the benchmark suite crashes with segfault.
I am able to run the dacapo benchmark suite in fs mode with kvm cpu, without 
any segfaults or exceptions. I have some simple java benchmarks tested; neither 
segfaults nor exceptions present.
Does anyone have suggestions or experience against these issues?

best,
Da Zhang
_______________________________________________
gem5-users mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
[email protected]<mailto:[email protected]>
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to