Re: An overview of the Parrot interpreter

Dan Sugalski Wed, 05 Sep 2001 09:52:02 -0700
[I'm answering these out of order--sorry. Already answered elsewhere bits 
snipped]
At 02:28 PM 9/5/2001 +0200, Paolo Molaro wrote:
>On 09/04/01 Dan Sugalski wrote:
> > > More on this point later in the mail.
> > >There's a reason for that: register virtual machines are more complex
> > >and more difficult to optimize.
> >
> > You'd be surprised there. The core of a register VM and a stack VM are
> > equivalent in complexity. A register VM makes reordering ops easier, tends
> > to reduce the absolute number of memory accesses (there's a *lot* of stack
> > thrash that goes on with a stack machine), and doesn't seem to have much
> > negative impact on things.
>
>When done in sw, the main difference is that stack-based machines have a 
>linear stack
>and the logic is simple, with register based machines you need to care for
>register windows.

This is a bit misleading.

If you assume a contiguous stack for both the register frames and the stack 
machine, then actually switching frames is cheaper on the register system 
since you only do it occasionally, as opposed to once per op.

If you assume a non-contiguous stack for the register frames and the stack 
machine, then switching frames gets proportionately more expensive for the 
stack machine, since the cost per switch goes up and you have far more 
switches.

Non-contiguous stacks make threading easier, which is a not-insignificant 
side benefit. You don't need to allocate X bytes in stack per thread and 
either waste memory or die because you ran out of stack.

Yes, the compiler needs to emit window change opcodes, but that's also a 
long-solved problem. In some ways we have it easier, since the compiler 
doesn't have to go pushing individual registers around in most cases. (We 
have bulk-renaming at hand essentially)

>As for memory accesses, stack based machines access
>always the top items of the stack (that fit in 1-2 cache lines) while
>
>         add_i N1, N15, N31
>
>may touch a lot more memory (especially if the int registers are int64).
>Also, since the indexes for the add_i ops can be any integer, the compiler
>can't optimize much, while in a stack based machine they are hardcoded
>(0 and 1, for example).

We can ignore register sizes, since for a fair comparison we'll be looking 
at the same size data on both the stack and register machines. With the 
register system you do pop between cache lines more, but the cost is the 
same once you touch them.

It may look like we touch more lines since we cover a larger spread in one 
go, while stack systems tend to jitter on a few, but that's not the case. 
The stack machine tends to do more named_memory<->stack transfers. In the 
code I've rolled by hand, the register model generally accesses a small 
number of cache lines a lot and a large number of lines very rarely. Stack 
machines, on the other hand, access a smaller number of lines even more 
frequently and a large number of lines occasionally.

Locality of reference works out to be about the same. With more than a 
half-dozen cache lines most code works out the same both ways, with a minor 
win towards register systems. (Minor enough that I'm willing to chalk it up 
to personal style) However, the stack system does *far* more bookkeeping 
access (that darned stack pointer gets updated every op) and executes more 
opcodes.

If we were targeting machines with tiny or non-existent caches, I'd say 
stack machines would get the nod. Nobody's built one of those in several 
years, though--even the Pentium II meets our criteria last I looked. D'you 
really think using consumer grade CPUs made no later than 1997 (or was it 
'96?) is worthwhile? My sweet spot's newer than that. Given the expected 
first release delivery date of sometime 2002, I don't see using "normal" 
consumer grade hardware made in 1998 as a minimum requirement as unreasonable.

Which is to say, if it runs well on my Celeron at home (the machine dates 
to October 1998, and was one of Compaq's presario consumer grade machines. 
The thing cost all of $700 at the time. It was *not* high-end, or even 
medium-end, hardware for the time) I'm fine with it. Older hardware can cope.

> > What, you mean the x86? The whole world doesn't suck. Alpha, Sparc, MIPS,
> > PA-RISC, and the PPC all have a reasonable number of registers, and by all
> > accounts the IA64 does as well. Besides, you can think of a register
>
>I would expect that from a CS guy! ;-)

I'll find one for 'ya. I'm an engineer. :-P

>My point is: optimize for the common
>usage patterns/environments and that, as much as we may not like it, is
>32 bit x86. Parrot may run super-fast on alpha, but if people have x86 
>machines
>and servers that won't be any good for them. The engine needs to run well
>on all the archs, but don't optimize for the special cases.

You underestimate 32-bit x86 systems, I think. But choosing between a model 
that runs OK everywhere and one that runs OK some places but better others, 
well, the choice isn't tough.

Also don't underestimate the rate of change. The current mid-grade x86 
hardware is more than up for the model. (And, when you get right down to 
it, the model looks the same--a register system can be easily viewed as a 
stack system with easy back-addressing into the stack)

> > > Literature is mostly concerned
> > >about getting code for real register machines out of trees and DAGs and
> > >the optimizations are mostly of two types:
> > >1) general optimizations that are independed on the actual cpu
> > >2) optimizations specific to the cpu
> > >
> > >[1] can be done in parrot even if the underlying virtual machine is
> > >register or stack based, it doesn't matter.
> >
> > Actually it does. Going to a register machine's generally more
> > straightforward than going to a stack based one. Yes, there are register
>
>What is easier than a post-order walk of a tree?

easy != fast. Yes, we need more than a naive post-order tree walk. So? 
We're not breaking any new ground, there's plenty of literature and code 
around that deals with this.

>A related point
>is that if parrot is meant to run code from different high level
>languages, it needs also to be simple for those languages to
>generate parrot bytecode.

Most of that will be dealt with in the parser and in the variable code. 
There will likely be very little to no compilation changes.

We're all just Algol with a different funny hat on, regardless of what many 
people might want to think. There's far less difference between perl, 
python, ruby, and Ada than between any one of us and Forth or Lisp. We all 
do the same stuff, by far the biggest difference is in how we spell things.

> > Why on earth are you assuming we're going to toss the optree, DAGs, or 
> even
> > the source? That's all going to be kept in the bytecode files.
> >
> > Also, even if that stuff is tossed, parrot bytecode makes a reasonable 
> MIR.
> > And some recent research indicates that even without all the intermediate
> > stuff saved you can buy a win. (There are currently executable translators
> > that go from one CPUs machine language to another directly. The resulting
> > executables run as fast or faster in most cases. I've even heard 
> reports of
> > one of the translators taking 386 executables and translating it out and
> > back into 386 executables that ran faster. I have no first-hand proof of
> > that, though)
>
>That's all well and good, but most of the research _available_ is
>not about binary translation. If we have on board people that have done
>such research all the better.

And the potential for not having those people is why we're not tossing the 
info. ;-) We're also keeping it around for our own usage, since it makes 
runtime optimization of precompiled code easier.

> > >All the steps are well documented and understood in the research
> > >(and especially open source) community.
> >
> > So what? Lots of things are well understood. That doesn't mean they're
> > particularly ideal. I don't much care if the OS community doesn't have 
> much
> > experience with this sort of thing--if it's faster perl wins. And the
>
>It's the same path perl took a while ago: it was fast and powerful and
>digged itself in a hole of hacks and unmaintainable code and now
>it's lagging behind. I care for Perl, I want it to be fast, too, but
>if it's not maintainable it won't be fast for long.

Perl's lack of maintainability is completely separate from its stack-based 
approach. In some ways some of the stack based things it does exacerbates 
the maintainability problem, but we'd have the same problems in a different 
guise if we weren't stack-based. All software inherently sucks, and it 
sucks worse the older it gets. (Keeping that one thing in mind will 
probably help more than any other single thing...:)

> > >Another point I'd like to make is: keep things simple.
> >
> > No. Absolutely not. The primary tenet is "Keep things fast". In my
> > experience simple things have no speed benefit, and often have a speed
> > deficit over more complex things. The only time it's a problem is when the
>
>Simple is not necessarily dumb. Merge sort is 'simpler' than qsort, though
>in the end it was choosen as the implementation for the perl sort operator.

Its simplicity wasn't the reason it was chosen. (And I'd argue about it 
being simpler) If we were going for simple, we'd choose bubble sort.

>Oh, this is listed under 'Performance enhancements' in perldelta.
>I'll reword it: keep things simple (but not too simple:-).

That's much better, and appropriate.

> > >A simple design is not necessarily slow: complex stuff adds
> > >dcache and icache pressure, it's harder to debug and optimize
> >
> > Now this is an issue. I&D cache pressure is a problem, yes. But one of 
> code
> > density, not of a register or stack architecture. (And given perl's 
> general
> > execution patterns, individual ops will end up swamping cache far more 
> than
> > the interpreter itself, no matter how you write the interpreter)
>
>32 integer regs + 32 float regs + 32 pmc regs it's at least 512 bytes
>of memory that can be randomly accessed. This would not be a problem
>if parrot didn't contain low-level opcodes, but it does.

That's still a single page of memory. I think you'll find it fits into your 
average L1 cache quite nicely. You'll also find that with stack based 
approaches there are more accesses to non-local memory within loops, and 
more memory accesses in general.

Yes, there is the potential for a larger local working set, but loops 
generally use only a subset of the registers, and making the loops tighter 
is generally of more use than making code that executes once tighter.

I'm willing to double the time that straight-line code executes in if we 
chop 20% off the time that a loop iteration takes.

>Dum de dum. I'd like to play the 'bash m$ game' myself, but this is not
>on topic here, I guess;-)

Fair enough, and it was a bit uncalled-for on my part.

>'byte' code existed way before m$ came along, there is no point
>calling for an updated Godwin's law:-)
>I gave the example in IL because I'm familiar with it, but any
>other bytecode would have been of roughly the same length.

I'm comparing things to perl 5's code stream, since it's more appropriate. 
We're both smaller and tighter than it is. I'm pretty sure the same goes 
for other interpreted languages. (I've not dived too deeply into python and 
ruby's code yet) Even if we're not, since perl 5 is still faster, even with 
the version-by-version slowdown, beating perl 5 means we beat everyone 
else. And beating perl 5 is the goal.

> > >A 4x-5x explosion in bytecode size will give a pretty big hit, both on
> > >memory usage and cache trashing: consider using _byte_ codes.
> >
> > We can't use byte codes. Perl right now has more than 255 ops. We'd 
> need to
>
>Most Perl5 ops don't need to be Perl6 ops: most of them are ops because
>calling a subroutine in perl5 is too expensive.

No, they're single ops because calling an *op* in perl is too expensive. 
We're going to have monster ops in perl 6 too, for the same reason. No 
matter how fast our opcode execution is, executing the some amount of 
low-level code in X ops is going to be faster than executing the same 
amount in X*2 ops.

>Optimize for
>the common case and use a prefix for the extra opcodes you may need.

Yech. No thanks. Prefix opcodes swapping function tables is one thing I 
don't want.

And yes, I fully understand this is an irrational dislike on my part, so 
it's on the list of things to occasionally revisit because of that. Page 
switching's never worked well in any of the places I've seen it used, though.

> > go with at least a 16-bit size, and then the cache savings is far, far
> > less. (Any scheme to shoehorn perl's op list into 8 bits will bring in a
> > mess of complexity that makes the register/stack decision you're worring
> > about pale in comparision) Going with less than 32 bits also brings in
> > problems with branch distances, constants, and a variety of other things.
>
>Care to elaborate on that? What have constants to do with the size of the
>byte codes? Branch distance would be in bytes, so I don't see any problem 
>there.

If the parameters in our code are always X bytes, we're limited to what 
will fit in a word of that size. You're not, I hope, considering having 
variable-length parameters serialized into bytes? Something like:

     branch -12335

is awfully expensive if branch is a single byte and the offset either has a 
length marker or is encoded in 2 or three bytes. Reassembly of stuff like 
that is pricey. If the offset is 4 bytes then we run afoul of unaligned 
word access, which costs everywhere. (Even on x86 chips)

> > We might, *might*, drop down to single bytes for register numbers, but 
> that
> > complicates loading wrong-endian bytecode, doing naive transforms on
>
>Can you explain why a single byte would give more problems than an integer
>you need to byteswap?

Having to byteswap is cheap and a one-off thing if all the words in the 
instruction stream are the same length. Load the file, swap all the 32-bit 
words, execute the now-native format stream. Faster than trying it where 
you actually need to interpret the code to know what the next parameter is 
and how big it is.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: An overview of the Parrot interpreter

Reply via email to