On Mon, May 2, 2016 at 5:55 AM, Kumar, Venkataramanan <venkataramanan.ku...@amd.com> wrote: >> If I compile on a k8 Opteron 248 with -march=native, I do not see -mprfchw >> listed in the options in -fverbose-asm. In the assembly, I see this: >> >> prefetcht0 (%rax) # ivtmp.1160 >> prefetcht0 304(%rcx) # >> prefetcht0 (%rax) # ivtmp.1160 > > In AMD processors -mprfchw flag is used to enable "3dnowprefetch" ISA > support. > > (Snip) > CPUID Fn8000_0001_ECX Feature Identifiers > Bit 8 > 3DNowPrefetch: PREFETCH and PREFETCHW instruction support. See “PREFETCH” and > “PREFETCHW” in APM3 > Ref: http://support.amd.com/TechDocs/25481.pdf > (Snip) > > Can you please confirm what this CPUID flag returns on your k8 machine ?. > I believe this ISA is not available on k8 machine so when -march=native is > added you don’t see -mprfchw in verbose.
Looks like zero? This was generated with the cpuid program from http://www.etallen.com/cpuid.html CPU 0: 0x00000000 0x00: eax=0x00000001 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65 0x00000001 0x00: eax=0x00000f58 ebx=0x00000800 ecx=0x00000000 edx=0x078bfbff 0x80000000 0x00: eax=0x80000018 ebx=0x68747541 ecx=0x444d4163 edx=0x69746e65 0x80000001 0x00: eax=0x00000f58 ebx=0x00000405 ecx=0x00000000 edx=0xe1d3fbff 0x80000002 0x00: eax=0x20444d41 ebx=0x6574704f ecx=0x286e6f72 edx=0x20296d74 0x80000003 0x00: eax=0x636f7250 ebx=0x6f737365 ecx=0x34322072 edx=0x00000038 0x80000004 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000005 0x00: eax=0xff08ff08 ebx=0xff20ff20 ecx=0x40020140 edx=0x40020140 0x80000006 0x00: eax=0x00000000 ebx=0x42004200 ecx=0x04008140 edx=0x00000000 0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000009 0x80000008 0x00: eax=0x00003028 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000009 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000b 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000c 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000d 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000e 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x8000000f 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000010 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000011 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000012 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000013 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000014 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000015 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000016 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000017 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80000018 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0x80860000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 0xc0000000 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 CPU: vendor_id = "AuthenticAMD" version information (1/eax): processor type = primary processor (0) family = Intel Pentium 4/Pentium D/Pentium Extreme Edition/Celeron/Xeon/Xeon MP/Itanium2, AMD Athlon 64/Athlon XP-M/Opteron/Sempron/Turion (15) model = 0x5 (5) stepping id = 0x8 (8) extended family = 0x0 (0) extended model = 0x0 (0) (simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon 64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um miscellaneous (1/ebx): process local APIC physical ID = 0x0 (0) cpu count = 0x0 (0) CLFLUSH line size = 0x8 (8) brand index = 0x0 (0) brand id = 0x00 (0): unknown feature information (1/edx): x87 FPU on chip = true virtual-8086 mode enhancement = true debugging extensions = true page size extensions = true time stamp counter = true RDMSR and WRMSR support = true physical address extensions = true machine check exception = true CMPXCHG8B inst. = true APIC on chip = true SYSENTER and SYSEXIT = true memory type range registers = true PTE global bit = true machine check architecture = true conditional move/compare instruction = true page attribute table = true page size extension = true processor serial number = false CLFLUSH instruction = true debug store = false thermal monitor and clock ctrl = false MMX Technology = true FXSAVE/FXRSTOR = true SSE extensions = true SSE2 extensions = true self snoop = false hyper-threading / multi-core supported = false therm. monitor = false IA64 = false pending break event = false feature information (1/ecx): PNI/SSE3: Prescott New Instructions = false PCLMULDQ instruction = false 64-bit debug store = false MONITOR/MWAIT = false CPL-qualified debug store = false VMX: virtual machine extensions = false SMX: safer mode extensions = false Enhanced Intel SpeedStep Technology = false thermal monitor 2 = false SSSE3 extensions = false context ID: adaptive or shared L1 data = false FMA instruction = false CMPXCHG16B instruction = false xTPR disable = false perfmon and debug = false process context identifiers = false direct cache access = false SSE4.1 extensions = false SSE4.2 extensions = false extended xAPIC support = false MOVBE instruction = false POPCNT instruction = false time stamp counter deadline = false AES instruction = false XSAVE/XSTOR states = false OS-enabled XSAVE/XSTOR = false AVX: advanced vector extensions = false F16C half-precision convert instruction = false RDRAND instruction = false hypervisor guest status = false extended processor signature (0x80000001/eax): family/generation = AMD Athlon 64/Opteron/Sempron/Turion (15) model = 0x5 (5) stepping id = 0x8 (8) extended family = 0x0 (0) extended model = 0x0 (0) (simple synth) = AMD Opteron (DP SledgeHammer SH7-C0) / Athlon 64 FX (DP SledgeHammer SH7-C0), 940-pin, .13um extended feature flags (0x80000001/edx): x87 FPU on chip = true virtual-8086 mode enhancement = true debugging extensions = true page size extensions = true time stamp counter = true RDMSR and WRMSR support = true physical address extensions = true machine check exception = true CMPXCHG8B inst. = true APIC on chip = true SYSCALL and SYSRET instructions = true memory type range registers = true global paging extension = true machine check architecture = true conditional move/compare instruction = true page attribute table = true page size extension = true multiprocessing capable = false no-execute page protection = true AMD multimedia instruction extensions = true MMX Technology = true FXSAVE/FXRSTOR = true SSE extensions = false 1-GB large page support = false RDTSCP = false long mode (AA-64) = true 3DNow! instruction extensions = true 3DNow! instructions = true extended brand id (0x80000001/ebx): raw = 0x405 (1029) BrandId = 0x405 (1029) BrandTableIndex = 0x10 (16) NN = 0x5 (5) AMD feature flags (0x80000001/ecx): LAHF/SAHF supported in 64-bit mode = false CMP Legacy = false SVM: secure virtual machine = false extended APIC space = false AltMovCr8 = false LZCNT advanced bit manipulation = false SSE4A support = false misaligned SSE mode = false 3DNow! PREFETCH/PREFETCHW instructions = false OS visible workaround = false instruction based sampling = false XOP support = false SKINIT/STGI support = false watchdog timer support = false lightweight profiling support = false 4-operand FMA instruction = false NodeId MSR C001100C = false TBM support = false topology extensions = false brand = "AMD Opteron(tm) Processor 248" L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax): instruction # entries = 0x8 (8) instruction associativity = 0xff (255) data # entries = 0x8 (8) data associativity = 0xff (255) L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx): instruction # entries = 0x20 (32) instruction associativity = 0xff (255) data # entries = 0x20 (32) data associativity = 0xff (255) L1 data cache information (0x80000005/ecx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0x2 (2) size (Kb) = 0x40 (64) L1 instruction cache information (0x80000005/edx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 0x2 (2) size (Kb) = 0x40 (64) L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax): instruction # entries = 0x0 (0) instruction associativity = L2 off (0) data # entries = 0x0 (0) data associativity = L2 off (0) L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx): instruction # entries = 0x200 (512) instruction associativity = 4-way (4) data # entries = 0x200 (512) data associativity = 4-way (4) L2 unified cache information (0x80000006/ecx): line size (bytes) = 0x40 (64) lines per tag = 0x1 (1) associativity = 16-way (8) size (Kb) = 0x400 (1024) L3 cache information (0x80000006/edx): line size (bytes) = 0x0 (0) lines per tag = 0x0 (0) associativity = L2 off (0) size (in 512Kb units) = 0x0 (0) Advanced Power Management Features (0x80000007/edx): temperature sensing diode = true frequency ID (FID) control = false voltage ID (VID) control = false thermal trip (TTP) = true thermal monitor (TM) = false software thermal control (STC) = false 100 MHz multiplier control = false hardware P-State control = false TscInvariant = false Physical Address and Linear Address Size (0x80000008/eax): maximum physical address bits = 0x28 (40) maximum linear (virtual) address bits = 0x30 (48) maximum guest physical address bits = 0x0 (0) Logical CPU cores (0x80000008/ecx): number of CPU cores - 1 = 0x0 (0) ApicIdCoreIdSize = 0x0 (0) SVM Secure Virtual Machine (0x8000000a/eax): SvmRev: SVM revision = 0x0 (0) SVM Secure Virtual Machine (0x8000000a/edx): nested paging = false LBR virtualization = false SVM lock = false NRIP save = false MSR based TSC rate control = false VMCB clean bits support = false flush by ASID = false decode assists = false SSSE3/SSE5 opcode set disable = false pause intercept filter = false pause filter threshold = false NASID: number of address space identifiers = 0x0 (0): (instruction supported synth): CMPXCHG8B = true conditional move/compare = true PREFETCH/PREFETCHW = true (multi-processing synth): none (multi-processing method): AMD (synth) = AMD Opteron (DP SledgeHammer SH7-C0), 940-pin, .13um Processor 248 >> >> If I compile on a bdver2 Opteron 6386 SE with -march=k8 (thus trying to >> target the older system), I do see it listed in the options in >> -fverbose-asm. In >> the assembly, I see this: > > K8 has 3dnow support and there is a patch that replaced 3dnow with prefetchw > (3DNowPrefetch). > https://gcc.gnu.org/ml/gcc-patches/2013-05/msg00866.html > So when you add -march=k8 you see -mprfchw getting listed in verbose. > >> >> prefetcht0 (%rax) # ivtmp.1160 >> prefetcht0 304(%rcx) # >> prefetchw (%rax) # ivtmp.1160 >> >> (The third line is the only difference) >> > > This is my guess without seeing the test case, when write prefetching is > requested "prefetchw" is generated. > 3dnow (TARGET_3DNOW) ISA has support for it. > > (Snip) > Support for the PREFETCH and PREFETCHW instructions is indicated by CPUID > Fn8000_0001_ECX[3DNowPrefetch] OR Fn8000_0001_EDX[LM] OR > Fn8000_0001_EDX[3DNow] = 1. > (Snip) > Ref: http://developer.amd.com/wordpress/media/2008/10/24594_APM_v3.pdf > >> In both cases, I'm using gcc 4.9.3. Which is correct for a k8 Opteron 248? >> >> Also, FWIW: >> >> 1) The march=native version that uses prefetcht0 is very repeatably faster by >> about 15% in the particular test case I'm looking at. >> >> 2) The compilers in both instances are not just the same version, they are >> the >> same compiler binary installed on an NFS mount and shared to both >> computers. > > As per GCC4.9.3 source. > > (Snip) > (define_expand "prefetch" > [(prefetch (match_operand 0 "address_operand") > (match_operand:SI 1 "const_int_operand") > (match_operand:SI 2 "const_int_operand"))] > "TARGET_PREFETCH_SSE || TARGET_PRFCHW || TARGET_PREFETCHWT1" > { > bool write = INTVAL (operands[1]) != 0; > int locality = INTVAL (operands[2]); > > gcc_assert (IN_RANGE (locality, 0, 3)); > > /* Use 3dNOW prefetch in case we are asking for write prefetch not > supported by SSE counterpart or the SSE prefetch is not available > (K6 machines). Otherwise use SSE prefetch as it allows specifying > of locality. */ > if (TARGET_PREFETCHWT1 && write && locality <= 2) > operands[2] = const2_rtx; > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > operands[2] = GEN_INT (3); > else > operands[1] = const0_rtx; > }) > (Snip) > > Write prefetch may be requested (either by auto prefetcher or builtins) but > on -march=native, the below check could have become false. > else if (TARGET_PRFCHW && (write || !TARGET_PREFETCH_SSE)) > TARGET_PRFCHW is off on native. > > So there are two issues here. > > (1) ISA flags enabled with -march=k8 is different from -march=native on k8 > machine. > (2) Need to check why GCC middle end requested write prefetch for the test > case with -march=k8 . > > Regards, > Venkat.