Hi Shaokun,
On 01/06/18 10:56, Zhangshaokun wrote:
Hi Ramana,
Sorry to reply so later because of short leave.
On 2018/5/23 18:41, Ramana Radhakrishnan wrote:
On 23/05/2018 03:50, Zhangshaokun wrote:
Hi Ramana,
On 2018/5/22 18:28, Ramana Radhakrishnan wrote:
On Tue, May 22, 2018 at 9:40 AM, Shaokun Zhang
<zhangshao...@hisilicon.com> wrote:
tsv110 is designed by HiSilicon and supports v8_4A, it also optimizes
L1 Icache which can access L1 Dcache.
Therefore, DC CVAU is not necessary in __aarch64_sync_cache_range for
tsv110, is there any good idea to skip DC CVAU operation for tsv110.
A solution would be to use an ifunc but on a cpu variant.
ifunc, can you give further explanation?
If on a cpu variant, for HiSilicon tsv110, we have two version and CPU variants
are 0 and 1. Both are expected to skip DC CVAU operation in sync icache and
dcache.
Since it is not necessary for sync icache and dcache, it is beneficial for
performance to skip the redundant DC CVAU and do IC IVAU only.
For JVM, __clear_cache is called many times.
Thanks for some more detail as to where you think you want to use this. Have
you investigated whether the jvm can actually elide such a call rather than
trying to fix this in the toolchain ?
In fact, We(HiSilicon) want optimize DC CVAU not only in the toolchain, but
also for LLVM and others.
If you really need to think about solutions in the toolchain -
The simplest first step would be to implement the changes hinted at by the
comment in aarch64.h .
If you read the comment above CLEAR_INSN_CACHE in aarch64.h you would see that
/* This definition should be relocated to aarch64-elf-raw.h. This macro
should be undefined in aarch64-linux.h and a clear_cache pattern
implmented to emit either the call to __aarch64_sync_cache_range()
directly or preferably the appropriate sycall or cache clear
instructions inline. */
#define CLEAR_INSN_CACHE(beg, end) \
extern void __aarch64_sync_cache_range (void *, void *); \
__aarch64_sync_cache_range (beg, end)
Thus I would expect that by implementing the clear_cache pattern and deciding
whether to put out the call to the __aarch64_sync_cache_range function or not
depending on whether you had the tsv110 chosen on the command line would allow
you to have an idea of what the performance gain actually is by compiling the
jvm with -mcpu=tsv110 vs -march=armv8-a. You probably also want to clean up the
trampoline_init code while you are here.
Thanks for your nice explanation and guidance.
For our next generation cpu core tsv200, We will optimize for IC IVAU that
there is no need to
flush Icache, keep the clear_cache as NOP function. Shall I consider this? or
Maybe i lose
something what your said.
I've had a look at the __clear_cache implementation and investigated these
cache coherency bits.
If clearing the instruction cache means you don't need to explicitly clear the
data cache then
the IDC bit of the CTR_EL0 register will be set to 1. This is how you can
identify that you can
avoid the explicit "DC CVAU" in __clear_cache.
Have a look at the D10.2.33 section in the Arm Architecture Reference Manual
Issue C.a [1]
for more documentation.
To implement this elision in libgcc you'd need to extend
__arch64_sync_cache_range
in config/aarch64/sync_cache.c to read the IDC bit from CTR_EL0.
The code there already reads CTR_EL0 and caches its value so you just need to
extract that bit
and use it to decide whether to perform the "DC CVAU" loop.
But that should a patch on its own.
Your patch to add a tsv110 entry into aarch64-cores.def can be respun and
reviewed separately.
Thanks,
Kyrill
[1]
https://developer.arm.com/products/architecture/a-profile/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
Thanks,
Shaokun
I do think that's something that should be easy enough to do and the subject of
a patch series in its own right. If your users can rebuild the world for tsv110
then this is sufficient.
If you want to have a single jvm binary without any run time checks, then you
need to investigate the use of ifuncs which are a mechanism in the GNU
toolchain for some of this kind of stuff. We tend not to ifuncs on a per CPU
basis unless there is a very good reason and the performance improvement is
worth it (but probably more on a per architecture or per architectural basis)
and you will need to make the case for it including what sort of performance
benefits it gives. Some introduction about this feature can be found here.
https://sourceware.org/glibc/wiki/GNU_IFUNC
regards
Ramana
Hi ARM guys,
are you happy to share yours idea about this?
Is this really that important for performance and on what workloads ?
Since it is not necessary for sync icache and dcache, it is beneficial for
performance to skip the redundant DC CVAU and do IC IVAU only.
For JVM, __clear_cache is called many times.
Thanks,
Shaokun
regards
Ramana
Any thoughts and ideas are welcome.
Shaokun Zhang (1):
[aarch64] Add HiSilicon tsv110 CPU support.
gcc/ChangeLog | 9 +++
gcc/config/aarch64/aarch64-cores.def | 5 ++
gcc/config/aarch64/aarch64-cost-tables.h | 103
+++++++++++++++++++++++++++++++
gcc/config/aarch64/aarch64-tune.md | 2 +-
gcc/config/aarch64/aarch64.c | 79 ++++++++++++++++++++++++
gcc/doc/invoke.texi | 2 +-
6 files changed, 198 insertions(+), 2 deletions(-)
--
2.7.4
.