Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
Am 30.12.21 um 08:23 schrieb Alexey Tor. via lazarus: New unit test, with Martin's integrated. If I play with godbolt, Ryzen zen3 (ryzen 5x00X) is nearly twice as fast in cycles as my Ivy Bridge, so I would like to see some benchmarks from various processors. Also from very old ones (P4 and Clawhammers) to test instruction sets. Project utf8lentest raised exception class 'External: SIGSEGV'. In file 'utf8lentest.lpr' at line 89: movdqu xmm0, [rcx] OS: Linux x64. CPU: Linux uses different calling conventions, please check with the patch below. vendor_id = "GenuineIntel" (simple synth) = Intel Core (unknown type) (Sandy Bridge D2/J1/Q0) {Sandy Bridge}, 32nm 15c15 < {define asmdebug} --- > { $define asmdebug} 46c46 < function asmutf8length(const s : pchar;len:integer):int64; --- > function asmutf8length(const s : pchar;len:int64):int64;assembler;nostackframe; 49d48 < begin 52c51 < mov r8,rdx --- > mov r8,len 89c88 < movdqu xmm0, [rcx] --- > movdqu xmm0, oword ptr [s] 95c94 < add rcx,16 --- > add s,16 128c127 < movzx r8d, byte [rcx]// unaligned bytes after sse loop --- > movzx r8d, byte [s] // unaligned bytes after sse loop 135c134 < inc rcx --- > inc s 140c139 < end['xmm5','xmm6']; // volatile registers used. --- > ret -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 30-12-2021 10:15, Florian Klämpfl via lazarus wrote: Linux uses different calling conventions, please check with the patch below. Linux is quite generous with the volatile registers, so luckily it matches quite closely. I first tried the approach of your patch, but [s] has problems on windows, so would require ifdef on every "s"use, so I simply move [s] to rcx {$ifndef Windows} // we can't use [s] as an alias for the pointer parameter, because the non assembler procedure on Windows // changes that into a stack reference. FPC doesn't support non volatile frame management for assembler procs like Delphi does. mov rcx,s // rdi mov edx,len // rsi {$endif} and the ifdeffing of the assembler procedure on linux vs inline asm block on Windows. Then it works on Linux x86_64. Funnily, our server AMD Athlon 200GE (Zen1, 3.2GHz?) nearly the exact same timings as my i7-3770 3.4GHz I did some other minor work after last post, so here is now the entire program:// // (C) 2021 Martin Friebe and Marco van de Voort. // attempt to accelerate utf8lengthfast which is a length(s) in utf8 codepoints without integrity checking // // 4 versions. // - Original, // - with popcount and // - the "add" variant that accumulates 127 iterations of ptrints and only adds // the intermeidates outside that loop // - a SSE2 version loosely inspired by the add variant combined with //the core of an existing (branchless) binarization routine for the main loop. {$mode objfpc}{$H+} {$asmmode intel} {$coperators on} {define asmdebug} uses SysUtils,StrUtils; const mask3 : array[0..15] of byte = ( $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0, $C0,$C0,$C0,$C0); mask4 : array[0..15] of byte = ( $80,$80,$80,$80, $80,$80,$80,$80, $80,$80,$80,$80, $80,$80,$80,$80); mask2 : array[0..15] of byte = ( $1,$1,$1,$1, $1,$1,$1,$1, $1,$1,$1,$1, $1,$1,$1,$1); // Integer arguments are passed in registers RCX, RDX, R8, and R9. // Floating point arguments are passed in XMM0L, XMM1L, XMM2L, and XMM3L. // volatile: RAX, RCX, RDX, R8, R9, R10, R11 // nonvolatile RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered nonvolatile // volatile xmm0-xmm3 (params) en xmm4,5 // https://msdn.microsoft.com/en-us/library/ms235286.aspx {$ifdef asmdebug} function asmutf8length(const s : pchar;len:integer,res:pbyte):int64; {$else} function asmutf8length(const s : pchar;len:integer):int64;{$ifndef Windows}assembler; nostackframe;{$endif} {$endif} {$ifdef Windows} begin {$endif} asm // tuning for short strings: // -- {$ifndef Windows} // we can't use [s] as an alias for the pointer parameter, because the non assembler procedure on Windows // changes that into a stack reference. FPC doesn't support non volatile frame management for assembler procs like Delphi does. mov rcx,s // rdi mov edx,len // rsi {$endif} test rax,rax je @theend cmp rdx,128 // threshold between long and short. jl @restbytes mov rax,rdx mov r10,rcx and r10,15 mov r9,16 sub r9,r10 and r9,15 test r9,r9 je @nopreloop sub rdx,r9 @preloop: // roughly 2 cycles per iteration on ivy bridge movzx r11d, byte [rcx]// unaligned bytes after sse loop mov r10,r11 shr r10,7 not r11 shr r11,6 and r10,r11 sub rax,r10 inc rcx dec r9 jne @preloop @nopreloop: mov r9,rdx and r9,15 shr rdx,4 pxor xmm5,xmm5 // always zero pxor xmm6,xmm6 // dword counts // using broadcast etc raises requirements? -> use constant loads. movdqu xmm1,[rip+mask3] movdqu xmm2,[rip+mask4] movdqu xmm3,[rip+mask2] test rdx,rdx je @restbytes @outer: mov r10,127 // max iterations per inner loop cmp r10,rdx // more or less left? jl @last // more mov r10,rdx // less @last: sub rdx,r10// iterations left - iterations to do pxor xmm4,xmm4 // process 127 iterations (limit of signed int8) @inner:// +/- 2.2 cycles per iteration for 16 bytes on ivy bridge movdqu xmm0, [rcx] pand xmm0,xmm1 // mask out top 2 bits pcmpeqb xmm0,xmm2// compare with $80. pand xmm0,xmm3 // change to $1 per byte. paddb xmm4,xmm0 // add to cumulative add rcx,16 dec r10 jne @inner // SSSE3 vertical adds might help this, but increase CPU reqs. movdqa xmm0,xmm4 PUNPCKLBW xmm0,xmm5 // zero extend to words PUNPCKHBW xmm4,xmm5 paddw xmm0,xmm4
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
Perhaps usefui test information from my PC: ** [john1@manjaro sdb2]$ ./utf8lentest 234526968 fst:128406168 pop:128406168 add:128406168 asm:128406168 29315871 fst 1365 fst 1367 fst 1366 fst 1366 pop 9990 pop 9990 pop 9997 pop 9981 add 1386 add 1382 add 1386 add 1390 asm 346 asm 346 asm 346 asm 349 fst 1357 fst 1368 fst 1372 fst 1371 pop 10681 pop 6886 pop 6895 pop 6916 add 1247 add 1248 add 1250 add 1248 asm 295 asm 291 asm 291 asm 293 [john1@manjaro sdb2]$ [john1@manjaro sdb2]$ inxi -F System: Host: manjaro Kernel: 5.10.84-1-MANJARO x86_64 bits: 64 Desktop: Xfce 4.16.0 Distro: Manjaro Linux Machine: Type: Laptop System: LENOVO product: 81RS v: Lenovo Yoga S740-14IIL serial: Mobo: LENOVO model: LNVNB161216 v: SDK0J40709 WIN serial: UEFI: LENOVO v: BYCN39WW date: 05/28/2021 Battery: ID-1: BAT0 charge: 62.4 Wh (95.6%) condition: 65.3/62.0 Wh (105.3%) CPU: Info: quad core model: Intel Core i7-1065G7 bits: 64 type: MT MCP cache: L2: 2 MiB Speed (MHz): avg: 3520 min/max: 400/3900 cores: 1: 3543 2: 3890 3: 2319 4: 3513 5: 3709 6: 3650 7: 3792 8: 3749 Graphics: Device-1: Intel Iris Plus Graphics G7 driver: i915 v: kernel Device-2: NVIDIA GP108M [GeForce MX250] driver: nvidia v: 495.44 Device-3: Chicony Integrated Camera type: USB driver: uvcvideo Display: x11 server: X.Org 1.21.1.2 driver: loaded: modesetting,nvidia unloaded: nouveau resolution: 1: 1920x1080~60Hz 2: 1920x1080~60Hz Message: Unable to show advanced data. Required tool glxinfo missing. Audio: Device-1: Intel Ice Lake-LP Smart Sound Audio driver: sof-audio-pci Sound Server-1: ALSA v: k5.10.84-1-MANJARO running: yes Sound Server-2: PipeWire v: 0.3.40 running: yes Network: Device-1: Intel Ice Lake-LP PCH CNVi WiFi driver: iwlwifi IF: wlp0s20f3 state: up mac: 04:33:c2:02:de:51 Device-2: Realtek RTL8153 Gigabit Ethernet Adapter type: USB driver: r8152 IF: enp0s13f0u1u4 state: up speed: 1000 Mbps duplex: full mac: 4c:e1:73:42:1f:6b IF-ID-1: pan1 state: down mac: 7a:5c:6a:f4:06:56 Bluetooth: Device-1: Intel AX201 Bluetooth type: USB driver: btusb Report: rfkill ID: hci0 state: up address: see --recommends Drives: Local Storage: total: 1.86 TiB used: 317.16 GiB (16.7%) ID-1: /dev/nvme0n1 vendor: Micron model: MTFDHBA1T0TCK size: 953.87 GiB ID-2: /dev/sda type: USB vendor: Western Digital model: WD10EARX-00N0YB0 size: 931.51 GiB ID-3: /dev/sdb type: USB vendor: Kingston model: DataTraveler 2.0 size: 14.54 GiB Partition: ID-1: / size: 57.9 GiB used: 35.88 GiB (62.0%) fs: ext4 dev: /dev/nvme0n1p8 ID-2: /boot/efi size: 259.5 MiB used: 114.1 MiB (44.0%) fs: vfat dev: /dev/nvme0n1p1 Swap: ID-1: swap-1 type: partition size: 16.67 GiB used: 0 KiB (0.0%) dev: /dev/nvme0n1p9 Sensors: System Temperatures: cpu: 58.0 C mobo: N/A Fan Speeds (RPM): N/A Info: Processes: 289 Uptime: 9m Memory: 15.2 GiB used: 2.19 GiB (14.4%) Shell: Bash inxi: 3.3.11 * Am 30.12.21 um 13:58 schrieb Marco van de Voort via lazarus: On 30-12-2021 10:15, Florian Klämpfl via lazarus wrote: Linux uses different calling conventions, please check with the patch below. Linux is quite generous with the volatile registers, so luckily it matches quite closely. I first tried the approach of your patch, but [s] has problems on windows, so would require ifdef on every "s"use, so I simply move [s] to rcx {$ifndef Windows} // we can't use [s] as an alias for the pointer parameter, because the non assembler procedure on Windows // changes that into a stack reference. FPC doesn't support non volatile frame management for assembler procs like Delphi does. mov rcx,s // rdi mov edx,len // rsi {$endif} and the ifdeffing of the assembler procedure on linux vs inline asm block on Windows. Then it works on Linux x86_64. Funnily, our server AMD Athlon 200GE (Zen1, 3.2GHz?) nearly the exact same timings as my i7-3770 3.4GHz I did some other minor work after last post, so here is now the entire program: -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 30-12-2021 14:17, John Landmesser via lazarus wrote: Perhaps usefui test information from my PC: 77 Compile with -O4 -Cpcoreavx2 , the others (non asm) will become faster, my guess is "add" will be about double of asm. Also, on windows "high performance" as power scheme. On non windows try to disable power saving for a short while another way. If the first and second runs deviate it is probably the CPU throttling up. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]
On 30/12/2021 14:43, Marco van de Voort via lazarus wrote: Compile with -O4 -Cpcoreavx2 , the others (non asm) will become faster, my guess is "add" will be about double of asm. Core I7 8700K 3.3.1 from Dec 10th 3.2.3 from Dec 9th With fpc 3.3.1: - fst is worse? - add gets better -O4 -Cpcoreavx2 fpc 3.2.3 / fpc 3.3.1 fst 594 fst 688 fst 578 fst 703 fst 578 fst 687 fst 562 fst 688 pop 485 pop 485 pop 500 pop 500 pop 500 pop 484 pop 484 pop 500 add 594 add 422 add 578 add 438 add 578 add 437 add 594 add 453 asm 250 asm 250 asm 250 asm 250 asm 250 asm 250 asm 250 asm 266 fpc 3.2.3 -O4 -Cpcoreavx -O4 -CpCOREI fst 594 fst 593 fst 578 fst 579 fst 578 fst 562 fst 594 fst 578 pop 500 pop 500 pop 515 pop 500 pop 500 pop 500 pop 485 pop 485 add 593 add 593 add 579 add 578 add 578 add 594 add 593 add 594 asm 250 asm 250 asm 250 asm 250 asm 235 asm 250 asm 250 asm 250 -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus
[Lazarus] More Webassembly...
Hello, I've just committed to the FPC git repo the wasmtime unit. This unit can be used to load and use the wasmtime library. Using this library you can compile, load and run Webassembly files embedded in your FPC program on all FPC & webassembly supported native platforms. There are 4 demo programs that show how this can be done. This means that you can now have a FPC-compiled webassembly module and run it in the browser or in a native binary. Additionally, all webassembly modules out there should also be callable from both native FPC and Pas2JS programs. Enjoy ! Michael. -- ___ lazarus mailing list lazarus@lists.lazarus-ide.org https://lists.lazarus-ide.org/listinfo/lazarus