On Thu, Jul 30, 2015 at 5:50 PM, Aurelien Jarno <aurel...@aurel32.net> wrote: > On 2015-07-30 10:55, Aurelien Jarno wrote: >> On 2015-07-30 10:16, Dennis Luehring wrote: >> > Am 30.07.2015 um 09:52 schrieb Aurelien Jarno: >> > >On 2015-07-30 05:52, Dennis Luehring wrote: >> > >> Am 29.07.2015 um 17:01 schrieb Aurelien Jarno: >> > >> >The point is that emulation has a cost, and it's quite difficult to >> > >> >to lower it and thus improve the emulation speed. >> > >> >> > >> so its just not strange for you to see an 1/100...200 of the native x64 >> > >> speed under qemu/SPARC64 >> > >> i hoped that someone will jump up an shout "its impossible - it needs >> > >> to be >> > >> a bug" ...sadly not >> > > >> > >Overall the ratio is more around 10, but in some specific cases where >> > >the TB cache is inefficient and TB can't be linked or with an >> > >inefficient MMU, a ratio of 100 is possible. >> > >> > >> > sysbench (0.4.12) --num-threads=1 --test=cpu --cpu-max-prime=2000 run >> > Host x64 : 1.3580s >> > Qemu SPARC64: 184.2532s >> > >> > sysbench shows nearly ration of 200 >> >> Note that when you say SPARC64 here, it's actually only the kernel, you >> are using a 32-bit userland. And that makes a difference. Here are my >> tests here: >> >> host (x86-64) 0.8976s >> sparc32 guest (sparc64 kernel) 99.6116s >> sparc64 guest (sparc64 kernel) 4.4908s >> >> So it looks like the 32-bit code is not QEMU friendly. I haven't looked >> at it yet, but I guess it might be due to dynamic jumps, so that TB >> can't be chained. > > This is the corresponding C code from sysbench, which is ran 10000 > times. > > | int cpu_execute_request(sb_request_t *r, int thread_id) > | { > | unsigned long long c; > | unsigned long long l,t; > | unsigned long long n=0; > | log_msg_t msg; > | log_msg_oper_t op_msg; > | > | (void)r; /* unused */ > | > | /* Prepare log message */ > | msg.type = LOG_MSG_TYPE_OPER; > | msg.data = &op_msg; > | > | /* So far we're using very simple test prime number tests in 64bit */ > | LOG_EVENT_START(msg, thread_id); > | > | for(c=3; c < max_prime; c++) > | { > | t = sqrt(c); > | for(l = 2; l <= t; l++) > | if (c % l == 0) > | break; > | if (l > t ) > | n++; > | } > | > | LOG_EVENT_STOP(msg, thread_id); > | > | return 0; > | } > > This is a very simple test, which is probably not a good representation > of the CPU performances, even more when emulated by QEMU. In addition to > that, given it mostly uses 64 bit integer, it's kind of expected that > the 32-bit version is slower. > > Anyway I have extracted this code into a C file (see attached file) that > can more easily compiled to 32 or 64 bit using -m32 or -m64. I observe > the same behavior than sysbench, even with qemu-user (which is not > surprising as the above code doesn't really put pressure the MMU. > > Running it in I get the following time: > x86-64 host 0.877s > sparc guest -m32 1m39s > sparc guest -m64 3.5s > opensparc T1 -m32 1m59s > opensparc T1 -m64 1m12s > > So overall QEMU is faster than a not so old real hardware. That said > looking at it quickly it seems that some of the FP instructions are > actually trapped and emulated by the kernel on the opensparc T1. > > Now coming back to the QEMU problem, the issue is that the 64-bit code > is using the udivx instruction to compute the modulo, while the 32-bit > code calls the __umoddi3 GCC helper.
Actually this looks like a bug/missing feature in gcc. Why doesn't it use udivx instruction in "SPARC32PLUS, V8+ Required" code? > It uses a lot of integer functions > based on CPU flags, so most of the time is spent computing them in > helper_compute_psr. I wonder if this can be optimized. I guess most RISC CPUs would have a similar problem. Unlike x86, the compilers usually optimize instructions on flag usage. If there is an instruction modifying flags in a code, the flags will be used for sure, so it probably makes a little sense to pospone the flag computation? Artyom -- Regards, Artyom Tarasenko SPARC and PPC PReP under qemu blog: http://tyom.blogspot.com/search/label/qemu