My code is not downloadable anywhere that I am aware of. I used to use
it in my quadratic sieve before I switched to Jason P's block Lanczos
code (BL solves a more restricted problem but is fine for the QS). My
old code is some of the earliest mathematical C code that I wrote, so
can certainly be
On Tuesday 20 May 2008, Bill Hart wrote:
> Martin,
>
> On the M4RI website you say that M4R inversion is asymptotically fast
> with time complexity n^3/log_2(n), but Strassen's method has
> complexity n^log2(7), which I would call asymptotically fast.
>
> Do you just mean asymptotically faster tha
Martin,
On the M4RI website you say that M4R inversion is asymptotically fast
with time complexity n^3/log_2(n), but Strassen's method has
complexity n^log2(7), which I would call asymptotically fast.
Do you just mean asymptotically faster than the classical algorithm?
By the way, I wrote some c
On Monday 19 May 2008, Bill Hart wrote:
> You seemed to be getting up to 8% at points there. That's definitely
> worth it. I'll be interested to see this evening how it comes out,
> though I recommend optimising my combine3 function (which I suppose
> should now be combine8), even including it inl
On Monday 19 May 2008, Bill Hart wrote:
> You seemed to be getting up to 8% at points there. That's definitely
> worth it. I'll be interested to see this evening how it comes out,
> though I recommend optimising my combine3 function (which I suppose
> should now be combine8), even including it inl
You seemed to be getting up to 8% at points there. That's definitely
worth it. I'll be interested to see this evening how it comes out,
though I recommend optimising my combine3 function (which I suppose
should now be combine8), even including it inline rather than have it
in a separate file.
Of
Bill,
I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is
worth the trouble (I agree that it make the otherwise pretty looking code
ugly).
I have some timings (for an old implementation) here:
http://trac.sagemath.org/sage_trac/ticket/3204#comment:2
My guess is tha
I had a look at gf2x. The reason they can make use of sse is because
of the shl and shr capabilities. Doing that 128 bits at a time is more
efficient since one operation can be done instead of quite a few using
general purpose regs.
Bill.
On 19 May, 13:52, Bill Hart <[EMAIL PROTECTED]> wrote:
>
Martin,
does your machine speed up at all with any of the SSE code? I spent
considerable time yesterday trying to optimise the combine3 function I
wrote (note: I didn't submit this improved code). Whilst I did speed
it up considerably by removing %'s and /'s and changing the function
prototype to
I figured out why 8 tables is optimal on this machine. The L1 cache is
128kb and at 2048x2048, 8 Gray tables with k = 5 fits exactly in half
the cache, which is probably optimal.
Bill.
On 19 May, 02:00, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> On Monday 19 May 2008, Bill Hart wrote:
>
>
>
>
On Monday 19 May 2008, Bill Hart wrote:
> Well, apparently there are speed gains right up to 8 Gray tables,
> though I really have no idea why that is.
>
> Here are the new times:
>
> Magma Old New
>
> 1x1:
>2.940s 3.13s 2.32s
>
> 16384x16384:
>9.250s 12.96s 9.17s
>
> 2x2:
Well, apparently there are speed gains right up to 8 Gray tables,
though I really have no idea why that is.
Here are the new times:
Magma Old New
1x1:
2.940s 3.13s 2.32s
16384x16384:
9.250s 12.96s 9.17s
2x2:
16.57s 22.43s 16.49s
32000x32000:
59.1s 90.2s 81.6s 65.7
I tried copying out the input matrices to the M4RM routine, but only
when the rows aren't all contiguous in memory. This didn't speed
anything up. Of course the reason for that is the second matrix is
only read a handful of times, to construct the Gray tables which are
then used extensively. The f
The copying out makes 50% difference (its better with copying) to the
speed of 16384x16384 but no difference to 1x1 or 2x2.
That's wierd.
Bill.
On 18 May, 17:36, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> Hi,
>
> first, I recorded the different speed-ups in a small table for a
If two Gray tables is better than one, then you can't have enough of a
good thing right? So I made 3 Gray tables now.
The files I modified are in:
http://sage.math.washington.edu/home/wbhart/m4ri3gray/
There are three main modifications:
1) Make all matrices SSE aligned if we have SSE2, and ma
Hi,
first, I recorded the different speed-ups in a small table for an overview in
the attachment (I think we've come a far way :-)) To disable the copying out
one needs to edit
/* we copy the matrix first since it is only constant memory
overhead and improves data locality, if you r
I sorted out what was wrong. In my combine code I was combining
operands in the wrong order. This meant that a whole lot of zeroes
ended up where they shouldn't be, hence the if (x) thing working. So
my times were unfortunately completely screwed up. I went back to a
fresh tarball with the origina
I managed to get the modified version from the spkg. Nice code!!
Unfortunately it is not as fast on my opteron. So more work tomorrow
for me to try and get it down to the same times as I have with my
version.
Here are the times all on my opteron. Note your CTD version was
optimal at a cutoff of
On 18 May, 00:40, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> My version is here:
>
> http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg
>
> (this needs an updated patch for Sage)
>
> and here:
>
> http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz
>
> (w
On May 17, 2008, at 8:38 PM, Bill Hart wrote:
> Of course one can go too crazy with optimisation.
No surely that never happens around here.
david
--~--~-~--~~~---~--~~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from
Here are the times I get with the different cutoffs.
Magma M4RI:7200 M4RI:2048
1x1:
2.940s 3.442s 4.132s
16384x16384:
9.250s 11.47s 11.80s
2x2:
16.57s 19.3s 26.05s
32000x32000:
59.05s 71.9s 71.8s
So it seems when there is not an exact cut, the higher cutoff is
sub
On Sunday 18 May 2008, Bill Hart wrote:
> I don't have the two Gray code tables, so it would be good to get your
> version. Also my code is currently a mess, so it would be good to
> clean it up by merging with a cleaner version (yours). Tomorrow I'll
> check carefully what I've changed and try an
P.S: yes all my times are on a 2.8Ghz Opteron. Cpuinfo says:
[EMAIL PROTECTED]:~/m4ri-20080514/testsuite> cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2220
stepping: 3
c
I don't have the two Gray code tables, so it would be good to get your
version. Also my code is currently a mess, so it would be good to
clean it up by merging with a cleaner version (yours). Tomorrow I'll
check carefully what I've changed and try and merge the ideas if there
are any you don't hav
> I suppose that this might be due to the ends of rows all being zero as
> they aren't a multiple of 64 bits long. But I checked for 16384x16384
> and we are nearly down to the speed of Magma there too. I just don't
> get it. The coinflip has to be broken I think.
If one uses M4RI with the new pa
Woot!!
On 17 May, 23:46, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> > Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
> > Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
> > 10,000 x 10,000 2.920 3.610
> > 16,384 x 16,384 11.140
> Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
> Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
> 10,000 x 10,000 2.920 3.610
> 16,384 x 16,384 11.140 12.120
> 20,000 x 20,000 20.370
I checked the coinflip and it is definitely fine. There is no greater
probability of 6 zeroes in a row than there ought to be. So the
speedup I just reported is quite a mystery.
Bill.
On 17 May, 22:57, Bill Hart <[EMAIL PROTECTED]> wrote:
> I suppose that this might be due to the ends of rows al
I suppose that this might be due to the ends of rows all being zero as
they aren't a multiple of 64 bits long. But I checked for 16384x16384
and we are nearly down to the speed of Magma there too. I just don't
get it. The coinflip has to be broken I think.
Bill.
On 17 May, 22:40, Bill Hart <[EMA
Martin,
Here's a really unusual thing. Perhaps you can confirm this. I get a
20% improvement if I add:
if (x)
{
}
in the three obvious places in the function _mzd_mul_m4rm_impl. This
stops it mpz_combining the zero row.
But I don't understand why this works. The time should be only 1.5%
better
Yet another idea.
Suppose we do not combine entire rows in the Gray table, but only half
rows. Once half a row is bigger than a single cache line (512 bits on
the Opteron) we may as well work with half rows. This allows us to
work with twice as many rows at once in the Gray tables (each of half
t
That's looking good. Would you like me to run it on an unburdened
opteron to see how it goes? If you like you can send me a tarball and
I'll try it out.
I think our best bet for a significant improvement now is the idea of
using two Gray tables of half the size simultaneously. I also realised
it
On Saturday 17 May 2008, Martin Albrecht wrote:
> > I think a better idea would be to explicitly force all matrices and
> > all rows to be 128 bit aligned if the matrices are wide enough to
> > benefit from SSE2, Then the combine function can always use SSE2 and
> > there will be no need to check
> I think a better idea would be to explicitly force all matrices and
> all rows to be 128 bit aligned if the matrices are wide enough to
> benefit from SSE2, Then the combine function can always use SSE2 and
> there will be no need to check for alignment.
That doesn't seem to make a noticeable d
On Saturday 17 May 2008, Bill Hart wrote:
> Martin,
>
> The test code still passes if you change RADIX to 128. I've no idea
> how it passes, but it does. Shame the results are not correct, because
> this speeds the code up by a factor of 2.
Since all routines use the RADIX and I only check if the
Heres another idea which should speed things up a bit.
For 1x1 we currently use k = 6. Instead of this, we could use
k = 5 and make two Gray tables simultaneously. This will still fit in
cache.
Instead of doing 6 bits at a time, we can then do 10 bits at a time.
We'd load the appropriate
Martin,
The test code still passes if you change RADIX to 128. I've no idea
how it passes, but it does. Shame the results are not correct, because
this speeds the code up by a factor of 2.
I notice that in the SSE code, you check to see if alignment can be
achieved, otherwise it doesn't use SSE.
Hi Martin,
Here is another 10% improvement. In the loop at the bottom of
mzd_combine you can explicitly unroll by a factor of 8:
word * end = b1_ptr + wide;
register word * end8 = end - 8;
while (b1_ptr < end8)
{
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
*(b1_ptr
Yes, of course, that is it. The Opteron can perform an MMX instruction
at the same time as an integer instruction (even an SSE instruction if
need be). We just need to set it up so that instructions get
interleaved between the two units.
Probably the reason Magma jumps nearly a factor of 2 at tha
On Saturday 17 May 2008, Bill Hart wrote:
> In going from 5000x5000 to 1x1 Magma's time increases by a
> factor of less than 4. That is impossible. Strassen will never help us
> there. They must be doing something else. Probably something clever.
>
> Bill.
I was stuck there too yesterday
In going from 5000x5000 to 1x1 Magma's time increases by a
factor of less than 4. That is impossible. Strassen will never help us
there. They must be doing something else. Probably something clever.
Bill.
On 17 May, 02:08, Bill Hart <[EMAIL PROTECTED]> wrote:
> That may not be necessary.
That may not be necessary. It may only be necessary to know the size
of the L1 and L2 caches and then it is probably possible to figure out
the optimal values. Probably it is something like 2^k rows of B must
fit in half the L1 cache and BLOCK rows of A must fit in half the L1
cache.
I'm not sure
On 16 May, 23:05, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> On Friday 16 May 2008, Bill Hart wrote:
> > The other possibility is that Magma combines the two algorithms so
> > that there is even greater usage of the Gray code tables. This would
> > be an ugly hack, but could work.
>
> Are you
On May 17, 2:48 am, Bill Hart <[EMAIL PROTECTED]> wrote:
> Hi Martin,
>
> That spike is wierd. Basically I got closer to 2x the time of Magma
> for 16384x16384, but you need different parameters than for
> 1x1 or 2x2 since the size of the M4R matrices will be
> different than in
Hi Martin,
That spike is wierd. Basically I got closer to 2x the time of Magma
for 16384x16384, but you need different parameters than for
1x1 or 2x2 since the size of the M4R matrices will be
different than in either of those cases.
For 16384x16384 my notes say that I used k = 6
On Friday 16 May 2008, Bill Hart wrote:
> Here are the times I get for Magma vs M4RI now. Note the crossover
> between the two programs is now above about 5000 and M4RI beats Magma
> below that point. This suggests the remaining factor of 2 is in the
> Strassen-Winograd function. Probably Winograd
Here are the times I get for Magma vs M4RI now. Note the crossover
between the two programs is now above about 5000 and M4RI beats Magma
below that point. This suggests the remaining factor of 2 is in the
Strassen-Winograd function. Probably Winograd-Strassen is falling out
of L2 cache (the previo
On Friday 16 May 2008, Bill Hart wrote:
> P.S: I also tried cache hints, but no luck. They just slow it down.
>
> Bill.
Bill, thanks so much for looking into this! It is very much appreciated. I'm
going to read/try your patch right away!
Martin
PS: I have a slight delay in my replies because m
P.S: I also tried cache hints, but no luck. They just slow it down.
Bill.
On 16 May, 15:59, Bill Hart <[EMAIL PROTECTED]> wrote:
> I tried cache blocking matrix B, but the times are exactly the same. I
> think the processor is happy to keep loading B one row at a time
> sequentially, and since i
I tried cache blocking matrix B, but the times are exactly the same. I
think the processor is happy to keep loading B one row at a time
sequentially, and since it is only done a handful of times in the
algorithm, it accounts for little of the runtime.
It might go faster in combination with storin
Here is the modified _mzd_mul_m4rm_impl function which gives this
speedup:
http://sage.math.washington.edu/home/wbhart/m4rmul.c
I used a crossover of 3600 and I indicate at the top of this file how
constants should be changed to get the speedups for the various values
of n. I didn't put any code
1x1 is down to 4.5s now, so nearly a 2x speedup.
2x2 is down to 32.0s, so again nearly a 2x speedup.
Bill.
On 16 May, 13:53, Bill Hart <[EMAIL PROTECTED]> wrote:
> I made some changes to the original code so it would use the cache a
> bit better. The test code seems to pass, so
I made some changes to the original code so it would use the cache a
bit better. The test code seems to pass, so I don't think I've screwed
anything up.
The time for 16384x16384 on my machine is now 20s, so a factor of
2.15x faster. The time for 2000x2000 also seems to be the same time as
Magma n
I think it might just be possible to get down to the speed of Magma
with a highly optimised classical multiplication routine. At 3600X3600
one clearly has to do 3600x3600 scalar products of a row by a column.
We'll assume one of the matrices has been transposed to facilitate
this.
If we use SSE2
On Thursday 15 May 2008, Bill Hart wrote:
> Here is the graph of Magma times:
>
> http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png
>
> The crossover is not clear. The change from a smooth curve to a
> squiggly line is about 3600. So presumably that is it, but the graph
> al
Here is the graph of Magma times:
http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png
The crossover is not clear. The change from a smooth curve to a
squiggly line is about 3600. So presumably that is it, but the graph
also seems to change character at about 6200 or 7000 as
Oh, actually I have no idea where Magma's crossover is. I'll wager it
is somewhere between 4000x4000 and 6000x6000, but let's not speculate.
I'll try and work it out with some timings.
Bill.
On 15 May, 22:23, Martin Albrecht <[EMAIL PROTECTED]>
wrote:
> On Thursday 15 May 2008, Bill Hart wrote:
On Thursday 15 May 2008, Bill Hart wrote:
> Martin,
>
> Do you think Magma uses naive multiplication for its base case? This
> can be ridicoulously fast, especially over GF2. I note for example
> that Magma's base case is about 6 times faster than M4RI at 1000x1000.
> Is it possible that the naive
Martin,
Do you think Magma uses naive multiplication for its base case? This
can be ridicoulously fast, especially over GF2. I note for example
that Magma's base case is about 6 times faster than M4RI at 1000x1000.
Is it possible that the naive multiplication can just be optimised
with a far bett
On Thursday 15 May 2008, Bill Hart wrote:
> Hi Martin,
>
> Here is a run that illustrates the problem. Am I doing something
> wrong?
No, I was stupid. The cpucycles are printed as %u but they should be printed
as %llu since they are longer than an int. I've attached the fixed C file
(since it is
Hi Martin,
Here is a run that illustrates the problem. Am I doing something
wrong?
[EMAIL PROTECTED]:~/m4ri-20080514/testsuite> time ./
bench_multiplication 5000 2048
n: 5000, cutoff: 2048, cpu cycles: 2670143764
real0m2.768s
user0m2.760s
sys 0m0.008s
[EMAIL PROTECTED]:~/m4ri-2008
> > It'll run the experiment 17 times (17 for no particular reason) and
> > return the median runtime. -n is the size and -c the cutoff.
> Median? Shouldn't you take the minimum? Are there any good papers
> on benchmarking?
I'm using cpucycles from:
http://www.ecrypt.eu.org/ebats/cpucycles.ht
On Thu, May 15, 2008 at 10:42 AM, Martin Albrecht
<[EMAIL PROTECTED]> wrote:
>
> On Thursday 15 May 2008, Bill Hart wrote:
>> Hi Martin,
>>
>> The cycle counter in your bench code seems to give random values on
>> the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
>> Magma 2-14.
On Thursday 15 May 2008, Bill Hart wrote:
> Hi Martin,
>
> The cycle counter in your bench code seems to give random values on
> the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
> Magma 2-14.10 on it.
>
> Anyhow, here are the Magma times:
> > A := RandomMatrix(GF(2),10^4,10^4)
> Maybe you're doing something wrong?
This bug is fixed in:
http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080515.p0.spkg
Cheers,
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_
Hi Martin,
The cycle counter in your bench code seems to give random values on
the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
Magma 2-14.10 on it.
Anyhow, here are the Magma times:
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(
On Thursday 15 May 2008, William Stein wrote:
> > Btw. I don't have access to Magma 2.14 which I believe to be the fastest
> > in linear algebra over GF(2). In version 2.14 they added SSE2 support
> > too. So if anybody could compare the new M4RI with Magma 2.14 I'd be
> > happy to hear about the
> Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
> linear algebra over GF(2). In version 2.14 they added SSE2 support too.
> So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
> about the results. I don't know what else to compare with except
68 matches
Mail list logo