Re: [Python-Dev] Optionally using GMP to implement long if available

2008-11-10 Thread Nick Craig-Wood
Tim Peters <[EMAIL PROTECTED]> wrote:
> > And for 32x32 -> 64, can't this simply be replaced by "(uint64_t)i * j",
> > where uint64_t is as in C99?  I'd hope that most compilers would
> > manage to turn this into the appropriate 32x32-bit hardware multiply.
> 
>  1. That's C99, not C89, and therefore less portable.
> 
>  2. On platforms that support it, this is at least 64x64->64 multiplication,
> potentially much more expensive than the 32x32->64 (or 31x31->62?)
> flavor you /intend/ to move to.
> 
>  3. There's no way to know exactly what compilers will do with this short of
> staring at generated code.

I've relied in the past for the compiler generating a 32*32->64 bit
multiply for this code

#include 

uint64_t mul(uint32_t a, uint32_t b)
{
return a*b;
}

Looking at the assembler it produces (x86)

mul:
pushl   %ebp
xorl%edx, %edx
movl%esp, %ebp
movl12(%ebp), %eax
imull   8(%ebp), %eax
popl%ebp
ret

Which I'm pretty sure is a 32x32->64 bit mul (though my x86 assembler
foo is weak).

I think a compiler would have to be pretty stupid not to take this
optimisation... But then there are some pretty stupid compilers out
there!

-- 
Nick Craig-Wood <[EMAIL PROTECTED]> -- http://www.craig-wood.com/nick
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Optionally using GMP to implement long if available

2008-11-10 Thread Mark Dickinson
On Mon, Nov 10, 2008 at 4:26 PM, Nick Craig-Wood <[EMAIL PROTECTED]> wrote:
>
> Looking at the assembler it produces (x86)
>
> mul:
>pushl   %ebp
>xorl%edx, %edx
>movl%esp, %ebp
>movl12(%ebp), %eax
>imull   8(%ebp), %eax
>popl%ebp
>ret
>
> Which I'm pretty sure is a 32x32->64 bit mul (though my x86 assembler
> foo is weak).

My x86 assembler is also weak (or perhaps I should say nonexistent),
but I think this does exactly what the C standard says it should: that is,
it returns just the low 32-bits of the product.

Looking at the assembler, I think the imull does a 32-bit by
32-bit multiply and puts its (truncated) result back into the 32-bit
register eax.  I'd guess that the 64-bit result is being returned to
the calling routine in registers edx (high 32 bits) and eax (low 32 bits);
this explains why edx has to be zeroed with the 'xorl' instruction.

And if we were really expecting a 64-bit result then there should
be an unsigned multiply (mull) there instead of a signed multiply
(imull);  of course they're the same modulo 2**32, so for a 32-bit
result it doesn't matter which is used.

Mark
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Optionally using GMP to implement long if available

2008-11-10 Thread Mark Dickinson
Here's the test code I was using, modeled on the basic operation
that longobject needs:  multiply two digits together and extract
high and low digits from the result.


typedef uint32_t digit;
typedef uint64_t twodigits;
#define SHIFT 30
#define MASK (((digit)1 << SHIFT) - 1)

/* multiply a and b, and split into high digit (returned)
   and low digit (put into *low) */

extern digit
digit_mul(digit *low, digit a, digit b) {
twodigits prod;
prod = (twodigits)a * b;
*low = (digit)(prod & MASK);
return (digit)(prod >> SHIFT);
}

Using gcc 4.0.1 on 32-bit x86 and compiling with "gcc -O1 -S test.c"
gives, for me, a file test.s that looks like:

.text
.globl _digit_mul
_digit_mul:
pushl   %ebp
movl%esp, %ebp
pushl   %esi
subl$4, %esp
movl16(%ebp), %eax
mull12(%ebp)
movl%eax, %esi
andl$1073741823, %esi
movl8(%ebp), %ecx
movl%esi, (%ecx)
shrdl   $30, %edx, %eax
addl$4, %esp
popl%esi
leave
ret
.subsections_via_symbols

There's only a single mull instruction there, showing that gcc
is doing the right thing, at least when optimization is turned on.
Without optimization, gcc produces three separate
multiplications, two of which are multiplications
by zero.

But if I compile with the -m64 flag to force 64-bit then the
multiply becomes:

imulq   %rsi, %rdx

which looks a lot like a 64 x 64 -> 64 multiply to me.  This
seems inefficient, when a 32 x 32 -> 64 bit multiply ought
to be good enough.  But maybe there isn't a significant
performance difference on x86_64?

Mark
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com