On Tue, Mar 13, 2012 at 3:07 PM, Stroller
<strol...@stellar.eclipse.co.uk> wrote:
>
> On 13 March 2012, at 18:18, Michael Mol wrote:
>> ...
>>> So I assume the i586
>>> version is better for you --- unless GCC suddenly got a lot better at
>>> optimizing code.
>>
>> Since when, exactly? GCC isn't the best compiler at optimization, but
>> I fully expect current versions to produce better code for x86-64 than
>> hand-tuned i586. Wider registers, more registers, crypto acceleration
>> instructions and SIMD instructions are all very nice to have. I don't
>> know the specifics of AES, though, or what kind of crypto algorithm it
>> is, so it's entirely possible that one can't effectively parallelize
>> it except in some relatively unique circumstances.
>
> Do you have much experience of writing assembler?
>
> I don't, and I'm not an expert on this, but I've read the odd blog article on 
> this subject over the years.

Similar level of experience here. I can read it, even debug it from
time to time. A few regular bloggers on the subject are like candy.
And I used to have pagetable.org, Ars's Technopaedia and specsheets
for early x86 and motorola processors memorized. For the past couple
years, I've been focusing on reading blogs of language and compiler
authors, academics involved in proofing, testing and improving them,
etc.

>
> What I've read often has the programmer looking at the compiled gcc bytecode 
> and examining what it does. The compiler might not care how many registers it 
> uses, and thus a variable might find itself frequently swapped back into RAM; 
> the programmer does not have any control over the compiler, and IIRC some 
> flags reserve a register for degugging (IIRC -fomit-frame-pointer disables 
> this). I think it's possible to use registers more efficiently by swapping 
> them (??) or by using bitwise comparisons and other tricks.

Sure; it's cheaper to null out a register by XORing it with itself
than setting it to 0.

>
> Assembler optimisation is only used on sections of code that are at the core 
> of a loop - that are called hundreds or thousands (even millions?) of times 
> during the program's execution. It's not for code, such as reading the 
> .config file or initialisation, which is only called once. Because the code 
> in the core of the loop is called so often, you don't have to achieve much of 
> an optimisation for the aggregate to be much more considerable.

Sure; optimize the hell out of the code where you spend most of your
time. I wasn't aware that gcc passed up on safe optimization
opportunities, though.

>
> The operations in question may only be constitute a few lines of C, or a 
> handful of machine operations, so it boils down to an algorithm that a human 
> programmer is capable of getting a grip on and comprehending. Whilst 
> compilers are clearly more efficient for large programs, on this micro scale, 
> humans are more clever and creative than machines.

I disagree. With defined semantics for the source and target, a
computer's cleverness is limited only by the computational and memory
expense of its search algorithms. Humans get through this by making
habit various optimizations, but those habits become less useful as
additional paths and instructions are added. As system complexity
increases, humans operate on personally cached techniques derived from
simpler systems. I would expect very, very few people to be intimately
familiar with the the majority of optimization possibilities present
on an amdfam10 processor or a core2. Compiler's aren't necessarily
familiar with them, either; they're just quicker at discovering them,
given knowledge of the individual instructions and the rules of
language semantics.

>
> Encryption / decryption is an example of code that lends itself to this kind 
> of optimisation. In particular AES was designed, I believe, to be amenable to 
> implementation in this way. The reason for that was that it was desirable to 
> have it run on embedded devices and on dedicated chips. So it boils down to a 
> simple bitswap operation (??) - the plaintext is modified by the encryption 
> key, input and output as a fast stream. Each byte goes in, each byte goes 
> out, the same function performed on each one.

I'd be willing to posit that you're right here, though if there isn't
a per-byte feedback mechanism, SIMD instructions would come into
serious play. But I expect there's a per-byte feedback mechanism, so
parallelization would likely come in the form of processing
simultaneous streams.

>
> Another operation that lends itself to assembler optimisation is video 
> decoding - the video is encoded only once, and then may be played back 
> hundreds or millions of times by different people. The same operations must 
> be repeated a number of times on each frame, then c 25 - 60 frames are 
> decoded per second, so at least 90,000 frames per hour. Again, the smallest 
> optimisation is worthwhile.

Absolutely. My position, though, is that compilers are quicker and
more capable of discovering optimization possibilities than humans
are, when the target architecture changes. Sure, you've got several
dozen video codecs in, say, ffmpeg, and perhaps it all boils down to
less than a dozen very common cases of inner loop code. With
hand-tuned optimization, you'd need to fork your assembly patch for
each new processor feature that comes out, and then work to find the
most efficient way to execute code on that processor.

There's also cases where processor features get changed. I don't
remember the name of the instruction (it had something to do with
stack operations) in x86, but Intel switched it from a 0-cycle
instruction to something more expensive. Any code which assumed that
instruction was a 0-cycle instruction now became less efficient. A
compiler (presuming it has a knowledge of the target processor's
instruction set properties) would have an easier time coping with that
change than a human would.

I'm not saying humans are useless; this is just one of those areas
which is sufficiently complex-yet-deterministic that sufficient
knowledge of the source and target environments would give a computer
the edge over a human in finding the optimal sequence of CPU
instructions.

-- 
:wq

Reply via email to