Re: An overview of the Parrot interpreter

Uri Guttman Tue, 04 Sep 2001 12:16:18 -0700
>>>>> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:

  DS> I don't buy that there's a higher bar on comprehension,
  DS> either. Register machines in general aren't anything at all
  DS> new. Granted, lots of folks grew up with the abomination that is
  DS> x86 assembly, if they even bothered hitting assembler in the first
  DS> place, but picking up a new, relatively straightforward,
  DS> architecture's not tough. Anyone who can manage a stack machine
  DS> can handle a register one and, having programmed in both 68K
  DS> assembly and Forth at the same time, I can say that for me a
  DS> register system is *far* easier to keep a handle on.

does it really matter about comprehension? this is not going to be used
by the unwashed masses. a stack machine is easier to describe (hence all
the freshman CS projects :), but as dan has said, there isn't much
mental difference if you have done any serious assembler coding. is the
pdp-8 (or the teradyne 18 bit extended rip off) with 1 register (the
accumulator) a register or stack based machine?

  >> You say that, since a virtual register machine is closer to the actual hw
  >> that will run the program, it's easier to produce the corresponding
  >> machine code and execute that instead. The problem is that that's true
  >> only on architectures where the virtual machine matches closely the
  >> cpu and I don't see that happenning with the current register starved
  >> main archs.

  DS> What, you mean the x86? The whole world doesn't suck. Alpha,
  DS> Sparc, MIPS, PA-RISC, and the PPC all have a reasonable number of
  DS> registers, and by all accounts the IA64 does as well. Besides, you
  DS> can think of a register machine as a stack machine where you can
  DS> look back in the stack directly when need be, and you don't need
  DS> to mess with the stack pointer nearly as often.

but it doesn't matter what the underlying hardware machine is. that is
the realm of the c compiler. there is no need to think about the map of
parrot to any real hardware. they will all have their benefits and
disadvantages running parrot depending on whatever. we can't control
that. our goal is a VM that is simple to generate for and interpret as
well as being powerful. stack based VM's aren't as flexible as a
register design. dan mentions some of the reasons below.

  >> Literature is mostly concerned
  >> about getting code for real register machines out of trees and DAGs and
  >> the optimizations are mostly of two types:
  >> 1) general optimizations that are independed on the actual cpu
  >> 2) optimizations specific to the cpu
  >> 
  >> [1] can be done in parrot even if the underlying virtual machine is
  >> register or stack based, it doesn't matter.

  DS> Actually it does. Going to a register machine's generally more
  DS> straightforward than going to a stack based one. Yes, there are
  DS> register usage issues, but they're less of an issue than with a
  DS> pure stack machine, because you have less stack snooping that
  DS> needs doing, and reordering operations tends to be simpler. You
  DS> also tend to fetch data from variables into work space less often,
  DS> since you essentially have more than one temp slot handy.

that more than one temp slot is a big win IMO. with stack based you
typically have to push/pop all the time to get anything done. here we
have 32 PMC registers and you can grab a bunch and save them and then
use them directly. makes coding the internal functions much cleaner. if
you have ever programmed on a register cpu vs. a stack one, you will
understand. having clean internal code is a major win for register
based. we all know how critical it is to have easy to grok internals. :)

  >> [2] will optimize for the virtual machine and not for the
  >> underlying arch, so you get optimized bytecode for a virtual
  >> arch. At this point, though, when you need to actually execute the
  >> code, you won't be able to optimize further for the actual cpu
  >> because most of the useful info (op trees and DAGs) are gone and
  >> there is way less info in the literature about emulating CPU than
  >> generating machine code from op-trees.

  DS> Why on earth are you assuming we're going to toss the optree,
  DS> DAGs, or even the source? That's all going to be kept in the
  DS> bytecode files.

also we are not directly targeting any real machines. you have to
separate the VM architecture from any cpu underneath. other than for TIL
or related stuff, parrot will never know or care about the cpu it is
running on.

  >> Another point I'd like to make is: keep things simple.

  DS> No. Absolutely not. The primary tenet is "Keep things fast". In my
  DS> experience simple things have no speed benefit, and often have a
  DS> speed deficit over more complex things. The only time it's a
  DS> problem is when the people doing the actual work on the system
  DS> can't keep the relevant bits in their heads. Then you lose, but
  DS> not because of complexity per se, but rather because of programmer
  DS> inefficiency. We aren't at that point, and there's no reason we
  DS> need to be. (Especially because register machines tend to be
  DS> simpler to work with than stack machines, and the bits that are
  DS> too complex for mere mortals get dealt with by code rather than
  DS> people anyway)

amen. i fully agree, register machines are much simpler to code with. i
don't know why paolo thinks stack based is harder to code to.


  >> A simple design is not necessarily slow: complex stuff adds
  >> dcache and icache pressure, it's harder to debug and optimize

  DS> Now this is an issue. I&D cache pressure is a problem, yes. But
  DS> one of code density, not of a register or stack architecture. (And
  DS> given perl's general execution patterns, individual ops will end
  DS> up swamping cache far more than the interpreter itself, no matter
  DS> how you write the interpreter)

and again that is a c issue and not a VM issue. there isn't any easy way
to control cpu caching from such a high level perspective. we are trying
to make coding ops (writing c code) and generating ops (from the
compiler) to be as easy and efficient as possible.

  >> A 4x-5x explosion in bytecode size will give a pretty big hit, both on
  >> memory usage and cache trashing: consider using _byte_ codes.


  DS> We might, *might*, drop down to single bytes for register numbers,
  DS> but that complicates loading wrong-endian bytecode, doing naive
  DS> transforms on bytecode (though whether naive anything at this
  DS> level is a good idea is a separate question), and has the
  DS> potential to shoot performance down badly.  Plus, since we'd
  DS> probably pad to either a 16 or 32 bit boundary there might not be
  DS> much speed win. If any at all.

on some machines like the alpha, getting single bytes is slower than
fetching 32 bits. 

  DS> This, though, is a spot we should investigate and do some
  DS> benchmarking on.

definitely. and it will be easy to change and test since only 2 major
places will know about it, the op code issue sub in the compiler and the
op code dispatch loop in the interpreter. the optimizer will know too
but it can be isolated there as well. so we can play with various op
code and register number sizes without much work. and benchmark them on
as many popular systems as we can lay our hands upon.

  DS> I think you'll find that most perl loops will either fit in L2
  DS> cache (possibly L1) or blow cache entirely doing data
  DS> operations. Either way the difference doesn't much matter.

i tend to agree. we do expect the main op code dispatch loop to stay in
cache as it should be small and tight. the rest will be op code
functions and they will compete for the cache like any other large C
program.


side note: parrot will not have much hardware stack usage as we do all
the stack stuff in the VM. the op code dispatch loop will call an op
wrapper function which will grab the args and call the op code function
(via vtable). that will then return all the way back to the main
loop. the op code itself may use the cpu stack but i don't expect it to
be used deeply by many ops. that is the extent of parrot's cpu stack
usage. there may be some exceptions with signals and events, but it will
(almost) never be deep. this means that threads will be created with
normal sized stacks (remember, ithreads is a parrot engine per system
thread) which will lower disk and cache thrashing too.

uri

-- 
Uri Guttman  ---------  [EMAIL PROTECTED]  ----------  http://www.sysarch.com
SYStems ARCHitecture and Stem Development ------ http://www.stemsystems.com
Search or Offer Perl Jobs  --------------------------  http://jobs.perl.org
Re: An overview of the Parrot interpreter

Reply via email to