Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-28 Thread Borislav Petkov
On Tue, Aug 28, 2012 at 12:17:43PM +0300, Jussi Kivilinna wrote: > With this patch twofish-avx is faster than twofish-3way for 256, 1k > and 8k tests. > > sizeold-vs-new new-vs-3way old-vs-3way > ecb-enc ecb-dec ecb-enc ecb-dec ecb-enc ecb-dec > 256 1.10x 1.11x 1.01x

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-28 Thread Jussi Kivilinna
Quoting Borislav Petkov : On Wed, Aug 22, 2012 at 10:20:03PM +0300, Jussi Kivilinna wrote: Actually it does look better, at least for encryption. Decryption had different ordering for test, which appears to be bad on bulldozer as it is on sandy-bridge. So, yet another patch then :) Here yo

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-23 Thread Borislav Petkov
On Wed, Aug 22, 2012 at 10:20:03PM +0300, Jussi Kivilinna wrote: > Actually it does look better, at least for encryption. Decryption had > different > ordering for test, which appears to be bad on bulldozer as it is on > sandy-bridge. > > So, yet another patch then :) Here you go: [ 153.736745

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-23 Thread Jussi Kivilinna
Quoting Jason Garrett-Glaser : On Wed, Aug 22, 2012 at 12:20 PM, Jussi Kivilinna wrote: Quoting Borislav Petkov : On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote: Looks that encryption lost ~0.4% while decryption gained ~1.8%. For 256 byte test, it's still slightly slower t

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-22 Thread Jason Garrett-Glaser
On Wed, Aug 22, 2012 at 12:20 PM, Jussi Kivilinna wrote: > Quoting Borislav Petkov : > >> On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote: >>> Looks that encryption lost ~0.4% while decryption gained ~1.8%. >>> >>> For 256 byte test, it's still slightly slower than twofish-3way >>>

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-22 Thread Jussi Kivilinna
Quoting Borislav Petkov : > On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote: >> Looks that encryption lost ~0.4% while decryption gained ~1.8%. >> >> For 256 byte test, it's still slightly slower than twofish-3way >> (~3%). For 1k >> and 8k tests, it's ~5% faster. >> >> Here's very

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-22 Thread Borislav Petkov
On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote: > Looks that encryption lost ~0.4% while decryption gained ~1.8%. > > For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k > and 8k tests, it's ~5% faster. > > Here's very last test-patch, testing different

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-21 Thread Jussi Kivilinna
Quoting Borislav Petkov : > > Here you go: > > [ 52.282208] > [ 52.282208] testing speed of async ecb(twofish) encryption Thanks! Looks that encryption lost ~0.4% while decryption gained ~1.8%. For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k and 8k tests, it'

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-20 Thread Borislav Petkov
On Fri, Aug 17, 2012 at 10:37:10AM +0300, Jussi Kivilinna wrote: > I made few further changes, mainly moving/interleaving 'vmovq/vpextrq' > ahead so they should be completed before those target registers are > needed. This only gave 0.5% increase on Sandy-bridge, but might help > more on Bulldozer.

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-17 Thread Jussi Kivilinna
Quoting Borislav Petkov : > > Yep, looks better than the previous run and also a bit better or on par > with the initial run I did. > I made few further changes, mainly moving/interleaving 'vmovq/vpextrq' ahead so they should be completed before those target registers are needed. This only gave 0

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-16 Thread Jussi Kivilinna
Quoting Borislav Petkov : On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote: About ~5% slower, probably because I was tuning for sandy-bridge and introduced more FPU<=>CPU register moves. Here's new version of patch, with FPU<=>CPU moves from original implementation. (Note: also

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-16 Thread Borislav Petkov
On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote: > About ~5% slower, probably because I was tuning for sandy-bridge and > introduced more FPU<=>CPU register moves. > > Here's new version of patch, with FPU<=>CPU moves from original > implementation. > > (Note: also changes encryptio

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Borislav Petkov : > On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote: > >> Patch replaces 'movb' instructions with 'movzbl' to break false >> register dependencies and interleaves instructions better for >> out-of-order scheduling. >> >> Also move common round code to separa

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Borislav Petkov
On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote: > Patch replaces 'movb' instructions with 'movzbl' to break false > register dependencies and interleaves instructions better for > out-of-order scheduling. > > Also move common round code to separate function to reduce object > size

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
> On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote: > > I posted patch that optimize twofish-avx few weeks ago: > > http://marc.info/?l=linux-crypto-vger&m=134364845024825&w=2 > > > > I'd be interested to know, if this is patch helps on Bulldozer. > > Sure, can you inline it here to

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Borislav Petkov
On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote: > I posted patch that optimize twofish-avx few weeks ago: > http://marc.info/?l=linux-crypto-vger&m=134364845024825&w=2 > > I'd be interested to know, if this is patch helps on Bulldozer. Sure, can you inline it here too please. The

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Borislav Petkov : Ok, here we go. Raw data below. Thanks alot! Twofish-avx appears somewhat slower than 3way, ~9% slower with 256byte blocks to ~3% slower with 8kb blocks. Let me know if you need more tests. I posted patch that optimize twofish-avx few weeks ago: http:

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Borislav Petkov
Ok, here we go. Raw data below. On Wed, Aug 15, 2012 at 02:00:16PM +0300, Jussi Kivilinna wrote: > >And if you tell me exactly how to run the tests and on what kernel, > >I'll try to do so. Ok, the box is a single-socket Bulldozer: "AMD FX(tm)-8100 Eight-Core Processor stepping 02"; kernel is 3.6

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Borislav Petkov : On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote: I started thinking about the performance on AMD Bulldozer. vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on Intel san

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Borislav Petkov
On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote: > I started thinking about the performance on AMD Bulldozer. > vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers > on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on > Intel sandy-bridge (where instr

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-15 Thread Jussi Kivilinna
Quoting Johannes Goetzfried : This patch adds a x86_64/avx assembler implementation of the Twofish block cipher. The implementation processes eight blocks in parallel (two 4 block chunk AVX operations). The table-lookups are done in general-purpose registers. For small blocksizes the 3way-p