Patches welcome.

=head1 Introduction

This is not a design document; it's a meta-design document - that is, it
tells us what things we need to design, the things we need to consider
during the design process of the Perl 6 internals.

It's completely unofficial, it's completely my opinion, it's merely
meant to serve as a map - so we know where we are and where we have to
go - and also as an I<aide memoire> to make sure we've thought of
everything.

It should also help to explain, to those new to this sort of design,
exactly what each "part" of the finished product should do or represent.
It's not a substitute for a good book on compiler design, but it should
help to relate the concepts you'll find in one to what's going on in the
Perl 6 effort.

=head1 General Design Decisions

=head2 Introduction

There are a number of general, high level design decisions we need to
take before writing any code at all. Many of them we haven't even faced
yet, so I don't know what they are, but a few have already cropped up.

=head2 Implementation Language

The most obvious question is: what language are we going to write Perl 6
in? Of the suggestions that have been advanced so far, four are worthy
of more consideration: C, C++, Java and a specially-designed Perl
Implementation Language. (PIL)

A brief note on the merits and demerits of each language: PIL would
allow us to write clean code and get away from the macro hell of Perl 5,
(as a sort of "super preprocessor") and could be translated to any of a
number of languages; unfortunately, we'd have to design PIL and write
translators for it - designing one language is fine, designing two seems
like overkill. Java is portable and gives us OO, but it's slow and ugly.
C++ gives us OO and headaches, is wildly non-portable due to a lack of
decent implementations, and we don't have enough experience of it. C's
portable and everyone knows it, but it's a swine for doing OO things.

=head2 Frangibility

It's become clear we want to be able to create multiple flavours of Perl
- compilers, interpreters, virtual machines for tiny platforms, embedded
interpreters, and so on. What I'm calling "frangibility" is a collection
of design decisions: where do we make speed/space tradeoffs? How do we
break off the back-end from the front-end? What other parts of the Perl
kit will we break off and dissociate? Will we ever be able to do it so
they are completely independent? 

Why does string C<eval> have to screw everything up?

=head1 General API

=head2 Introduction

What I'm calling the "general API" is the stuff that Perl is made of -
the utility functions that will form the basis of the compiler and
interpreter.

=head2 Variable Storage

We need to think about the data structures that will replace SVs, HVs
and AVs. I predict that such structures will be extensively used by the
internals themselves, so they need to be fast and efficient. 

We seem to have settled on a vtable model for SVs, which is cool. (A
vtable model means that each SV carries around with it a bunch of
function pointers which tell it how to "do stuff". Think of it like the
interface to L<Math::BigInt> - C<< $number->badd(10) >>.) Now we have to
work out the exact vtable structure that we're going to use, along with
what other flags and things we need to hang off an SV.

How do we represent strings? Is there going to be an internal character
set? How do we do conversion to and from it?

AVs - can we do anything better than an array of pointers? Is there a
way to combine the insertion efficiency of a doubly-linked list with the
access efficiency of a pointer array? Can we steal anything useful -
ideas or code - from the C<glib> implementation of pointer arrays, or
any other similar library? 

HVs - a hash is a hash is a hash, so fundamentally there's little we can
do with HVs. What have we learnt from Perl 5's hash implementation? Is
there a way we can get SV keys with the efficiency of string keys?

Let's also not forget that we need to provide a method of allowing
user-defined data types. How does this fit in with the vtable model? Do
we expect people to provide functions to fit a variable API - if so, we
need to define that API - or do we allow inheritance from a base
data type? Is this the sort of thing that's likely to replace ties?

=head2 Variable Manipulation

When we have our variable structures decided - or even before - we need
to know what we're planning on doing with them. What vtable operations
will we need? We need to look over the Perl 5 code, at the variable
operations API and see which functions we used, which we didn't, and
which we needed. 

=head2 Op Structure

An B<op> is a fundamental operation - fetch the value of a variable, add
two values together, match a regular expression, and so on. Ops form
part of an B<op tree>, which is a data structure representing the input
Perl code.

Ops in Perl 5 were horribly scary beasts, with all manner of bizarre
flags and fields, and multiple different ways of connecting the tree
together, and monstrous tentacles reaching hither and thither.
(C<op_first>, C<op_next>, C<op_last>, C<op_sibling>. Yum.)

Is there a better way to organise them? What have we learnt that we need
from ops in Perl 5? For instance, we've learnt that we need a generic
area to store C<COP> specific information, such as C<cop_warnings> and
C<cop_arybase>, which will no doubt extend as more pragmatic modules set
block-specific information.

=head2 Op Manipulation

We need an API for constructing, filling and manipulating op structures.
Not just for internal use, but, if Perl is going to be an environment
for language construction - which I shall translate into internal terms
as "we will compile or interpret op trees which the user feeds us, by
whatever means" - then we'll need this to apply for external use too.

Today's scary thought:

    use B; my $main_root = new B::OP;
    ...
    my $int = new B::Interpreter; $int->run($main_root)

=head2 IO Library

We need a good, layered, platform-independent IO implementation. Nick
Ing-Simmons has this licked for Perl 5. What sort of an API should it
provide? Must C<FILE> objects die?

=head2 Memory Allocation

If we're going with C, which I shall assume we are, we have to handle
memory allocation both in the compiler and in the interpreter ourselves.
Perl 5 currently has its own C<malloc> implementation, which can be 
more efficient than the system's C<malloc>. But it could be improved.
Memory allocation's pretty much a solved problem, so this shouldn't need
to require much thought.

=head1 Front-end

=head2 Introduction

The B<front-end> to a compiler is the part which turns the input code
into some kind of internal representation, usually an op tree as
mentioned earlier; this is called "parsing", although parsing is also
used to refer to just one of the stages in this process. Unfortunately,
Perl code is really very hard to parse, so we face some issues that
other languages don't.

For information, the usual stages that a front-end goes through are:

=over 3

=item tokenising

Splits the input code into meaningful chunks called "tokens" - for
instance C<$a + $b> would be tokenised into C<$a>, C<+> and C<$b>.
(White space is not meaningful, except when it is, if you see what I
mean.)

=item lexical analysis

Tells you what "part of speech" all of the tokens are: identifier,
binary operator, string, and so on. In many compilers - and indeed in
Perl 5 - tokenising and lexical analysis are combined, so people
generally say "tokenising" to mean both.

=item parsing

Arranges the tokens in terms of meaning. For instance, C<$a = $b + $c>
will be tokenised into 
C<identifier($a) ASSIGN identifier($b) PLUS identifier($c)>; the parser
will then arrange that into a data structure in a tree format, to
recognise that, for instance, you're not setting C<$a> equal to the
value of C<$b> and then adding C<$c> to the result. 

=head2 Design Decisions

The major decision we have to take here is what level of integration we
want between the three stages - whether to combine them all into a large
"turn Perl into op tree" component, or to enforce clear lines of
demarcation between them. 

The problem with splitting them up is that they pretty much have to act
in consort anyway due to the nature of the Perl language; the problem
with munging them all together is that we end up with a horrific mush
that nobody understands. (At the moment, Perl 5 has two separate
horrific mushes that nobody understands.)

=head2 Parser

There are two major sets of techniques for parsing - top-down and
bottom-up. Top down parsing says "we need to make a statement out of
this piece of code - out of the various ways of making a statement,
which does this code look most like?"; bottom-up parsing starts with the
code and tries to categorise it in ever-higher structures until you've
got a statement at the end. Bottom-up parsing is much more conducive to
a three-in-one parser, because you essentially treat every single
character as a different token.

There are several other things we need to think about here, but I can't
remember them.

=head2 Parser Interface

Again, if we're going to make a Perl a platform for building languages,
we need to make it easy to replace Perl's parser with a parser for other
languages, and the way to do this is to have a clearly defined interface
by which programmers can write their replacement. Exactly how this is
done depends to an extent on how we divide up the stages of the parser -
whether people have to replace the entire thing themselves, being fed
code and being expected to produce an op tree, (see L</Op Manipulation>)
or whether they can provide, say, a "drop-in" tokeniser.

=head1 Back-end(s)

=head2 Introduction

If a front-end turns code into a data structure, a B<back-end> turns -
you're not going to like this - a data structure into code. Different
code, usually, else it's no fun. The idea is that the front-end produces
an intermediate representation that's then easy for us to convert to
assembler or Java or whatever, or even run - that is, interpret. The
back-end does that conversion.

=head2 Optimizer

Before we do this, though, we need to look over the op tree and see if
we can make it any faster. There are a number of ways we do this:
peephole optimization looks for sequences of ops that can be replaced by
a single op; checking looks at each op and determines whether or not
it's necessary; constant folding works out the value anything that can
be calculated at compile time to save doing it at run time.

One optimization that we might want to think about is type inferencing.
It B<might> - and I repeat, might - be possible to infer optimal
variable types in certain circumstances. Similarly, we might want to
look into liveness analysis, depending on the GC model we choose.
However, as usual, the possibility of on-the-fly compilation and
interpreting interferes with this horribly. Are there any further
optimizations we can do which won't get clobbered by this? If we're
having user-defined ops, how can we possibly hope to optimize?

=head2 Garbage Collection

Data doesn't hang around in a program for ever, and eventually, you'll
end up with data that you're not using any more just sitting around in
memory. A garbage collection frees memory that taken up by data which is
no longer useful. 

There are several different garbage collection algorithms, with
different efficiency tradeoffs - Perl 5 uses reference counting, which
is the simplest system: Perl adds one to a number when something uses a
piece of data, and decreases the number when that thing stops using the
data. When the number hits zero, the data gets removed. This has
problems with circular references. There's also the mark-and-sweep
method, where data gets "marked" if it's in use and eventually, a
"sweep" is made over the whole storage space and anything that isn't
marked gets removed. This means, however, that you have to stop
processing periodically and sweep the storage space. (There are subtle
ways to minimise the sweeping by having "layers" of storage.)

Decisions of GC models rapidly become religious arguments.

=head2 Interpreter

Do we go for a register or a stack based interpreter? Does it matter? 
What op despatch models are there, and which suits us best? What about
signals? What interpreter-specific data do we need? How do we deal with
the C<IMPLICIT_CONTEXT> problem, and also the namespace issue?

Basically, all the ugly hacks in Perl 5 are there to work around real
problems - now we have the chance to see if we can come up with real
solutions.

=head2 Code generator

The code generator back-ends will spit out assembler or Java bytecode or
Perl bytecode or anything else. We really only need to produce one code
generator implementation immediately, and the rest can be produced
later. The questions in L</Interpreter> apply equally here, but there
are other ones, such as: how do we arrange the representation of the ops
for maximal read-execute parallelisation? What address-code model do we
use? How do we make it easy for the user to create their own code
generators? Do we want to compile to native binaries? Can we do this
without embedding a Perl interpreter?

=head1 Run Time Library

=head2 Introduction

The run time library is the set of functionality needed to run a Perl
program, whether it's compiled to native code, or whether it's being
interpreted. Everything that isn't pure assembler has a run time library
because the run time library provides the functions which carry out the
ops - the thing that sets Perl apart from C is that the Perl 5 run-time
library has to contain the ability to compile code on the fly: (when we
complain that compiled Perl requires you to "embed an interpreter", we
actually mean "embed a compiler" - although "the interpreter" implies a
bunch of state and global variables that form the environment that it
all has to execute in.) something like C<require> means that Perl has to
open a file, compile it, turn it into an op tree, and then run that op
tree.

=head2 Symbol Table Handler

One thing that needs careful thought is the way Perl-side variables are
stored; global variables can be easily dealt with by putting them into a
hash, which is exactly what happens right now. Lexical variables are
more tricky, and dynamic variables with C<local> make the situation very
difficult indeed. Is there a better symbol handling model for lexical
and dynamic variables? 

=head2 Regular Expression Engine

How do we make a regular expression as fast as Perl 5's but
maintainable?

=head2 Opcodes

Finally, and most importantly, what op codes does the core need? What op
codes I<doesn't> the core need? What level should we be targetting - the
very high level as we have at the moment, or a lower level like Java, to
allow better machine code generation? Is it, in fact, better to have
higher level ops, because that means we can tweak the code that
implements them much more?

When we say we want the user to be able to provide their own ops, what
does this mean? What interface do we need them to provide? Does this
mean that I<all> back-ends will be tied to either a register or a stack
based model for consistency of interface? (Or otherwise the user would
have to reimplement their custom op for each back-end, which sucks hard
- how do we evade that anyway?)

=cut

-- 
Those who do not understand Unix are condemned to reinvent it, poorly.
- Henry Spencer, University of Toronto Unix hack

Reply via email to