Assembler PDD

Simon Cozens Fri, 08 Mar 2002 14:52:44 -0800

This is a draft of the assembly/assembler PDD; it ought to make sense
without the bytecode PDD which it often refers to. I'll write that next,
and follow up with an assembler, assuming vague agreement to this and that
my bloody flu clears up soon.


Comments welcome!

=head1 TITLE

The Parrot Assembly Language

=head1 VERSION

1

=head2 CURRENT

    Maintainer: Simon Cozens <[EMAIL PROTECTED]>
    PDD Number: 
    Version: 1
    Status: Proposed
    Last Modified: Fri Mar  1 15:50:00 GMT 2002
    PDD Format: 1
    Language:

=head2 HISTORY

=over 4

=item version 1

None. First version

=back

=head1 CHANGES

=over 4

=item Version 1.0

None. First version

=back

=head1 ABSTRACT

This PDD defines the specifications for the Parrot assembly language;
the implementation section will describe the implementation of a
reference model Parrot assembler.

=head1 DESCRIPTION

The Parrot assembly language is a textual representation of a program to
be run on the Parrot interpreter. It is not the only possible
representation, and indeed it is hoped that most uses of Parrot will
directly produce bytecode through a compiler interface. However, Parrot
will ship with both an assembler and disassembler for human modification
of the bytecode.

=head2 Standard Format

A typical line of assembler consists of an optional label, an optional
operation followed by optional comma-delimited arguments, followed by an
optional comment.

Labels are indicated as alphanumeric characters followed immediately by
a colon. The colon is not taken to be part of the label. If the same
label is defined more than once, the assembly is invalid, and assemblers
may reject it.

Comments begin with a hash sign (#) and this marker, together with
everything else on the line following the hash sign, is ignored.

=head2 Operations

Operations are specified as alphanumeric characters, and are taken
from the loaded oplibs. The form of operations is the same as it is
in the oplibs: C<name_t_t_t> where C<t> represents the type of the
given parameter. (See the next section.)

Operations may optionally be specified in an abbreviated format, just
the C<name> above; the assembler must derive the full name of the
operation from the parameters specified. For instance, 

    set I1, 1

is equivalent to

    set_i_ic I1, 1

The conversion between C<set> and C<set_i_ic> is a required phase of
thet assembler's implementation.

=head2 Parameters

Parameters are of the following types:

=over 3

=item B<i>nteger register

=item B<i>nteger B<c>onstant

=item B<n>umeric register

=item B<n>umeric B<c>onstant

=item B<s>tring register

=item B<s>tring B<c>onstant

=item B<k>ey B<constant>

=item B<r>egister specifier

=back

The letters in bold face show the abbreviation for the parameter which
makes up the name of the operation as described above. In accordance
with the bytecode PDD, constant parameters other than integer constants
are represented in the bytecode as an integer index into the appropriate
constant table; registers are represented as an integer, and register
specifiers are emitted by encoding the register type in the top two bits
and the register number in the bottom six bits. See the bytecode PDD for
more details on this.

A label may be used as a parameter, in which case the label is to be
taken as an integer constant representing the number of tokens
between the C<beginning> of the current op and the site of the label's
definition.

=head2 Keyed parameters

As well as the parameter types mentioned above, there are compound
types based on keyed parameters, as described in the Keys PDD.

=head2 Pragmata

Lines beginning with a period (.) followed by some text are B<pragmata>
and are reserved. No pragmata are currently defined.

=head2 Macros

Parrot assembly has no I<native> macro capability. Implementations are free 
to add their own macro capabilities, although they must B<not> do so using
the pragmata syntax.

=head1 IMPLEMENTATION

A reference implementation of the Parrot assembler is provided. This
section briefly discusses the operation of this assembler.

The assembler goes through the following phases in converting assembly source
text to bytecode:

=over 3

=item Detach keys

=item Resolve labels

=item Expand ops

=item Separate constants

=item Emit bytecode

=back

=head2 Detach Keys

The first stage is to replace key accesses such as

    P1[S1]

with something like this:

    P1, S1

It must be remembered that in this case C<S1> is acting as a register
specifier, so the function prototype would be C<..._p_r_..>. Similarly,
the assembler must mark the fact that C<P1["foo"]> refers to a key
constant: simply translating this to C<P1, "foo"> would cause C<"foo">
to be erroneously detected as a string constant, leading to the
incorrect function prototype C<p_sc> rather than the correct C<p_kc>.
The pseudo-assembly C<P1, [k:"foo"]>, C<P1, [k:S1]> is encouraged to
make this distinction; however, Parrot assemblers need not accept this
format as valid input.

This is done immediately for several reasons: firstly, it means that at this
stage every token in the assembly represents one "token" 
(C<sizeof(opcode_t)> bytes) in the bytecode, and therefore calculating
relative offsets for branches can be done next. It also means that
constants inside key indices can be shared; furthermore, this step can
be done as a preprocessing stage without affecting further stages of
assembly.

=head2 Resolve Labels

The next pass over the assembly turns label references into integer
constants. It does this by counting the number of tokens between the
beginning of the current op and the definition of the label. For
instance:

    SPIN: inc I1
    lt I1, 100, SPIN

When assembling the second line, the occurrence of C<SPIN> refers to
the label which is two tokens previous to the current op, C<lt>. Hence,
C<SPIN> is replaced by C<-2>.

    lt I1, 1, ERROR
    gt I1, 100, ERROR
    branch OK
    ERROR: print "Register out of bounds"
    end
    OK:

In this case, the definition of C<ERROR> is 9 tokens after the start of
the first line and 6 tokens after the start of the second line; after
label resolution, this code would be equivalent to:

    lt I1, 1, 9
    gt I1, 100, 6
    branch 4
    print "Register out of bounds"
    end

Again, this stage simply translates a more human-friendly version of
the assembly to a more machine-readable variant, and can be performed
as a separate stage without altering the "assemblability" of the code.

=head2 Expand Ops

Now we must replace the short names of ops with their parameter-encoded
variants. As described above

    set S1, "Hello, Parrot\n"

is here replaced by

    set_s_sc S1, "Hello, Parrot\n"

The original plan was to perform the next step at this point, but then I
realised that if we do it this way around, we don't have to retain
information about what type the constant is, and we get rid of the whole
silly C<set_s_sc S1, [s:1]> thing.

=head2 Separate Constants

This next step does alter the validity of the code; all non-integer
constants, including key constants and string or floating-point
constants inside of key constants are stored in a constant table and
replaced by their index into the constant table. For instance, the
following code:

    set_s_sc S1, "Hello, Parrot\n"
    print_s S1
    end

is replaced with 

    set_s_sc S1, 1
    print_s S1
    end

and the string C<"Hello, Parrot"> is saved in slot one of the string
constant table. Assemblers B<must> share constants, so that:

    set_s_sc S1, "Hello, Parrot\n"
    print_sc "Hello, Parrot\n"

is replaced with

    set_s_sc S1, 1
    print_sc 1

and a second entry in the constant table is not shared.

As a reminder, non-integer constants inside key accesses are also
detached at this point. Hence, in

    set_s_p_kc S1, P1, [k:"Hello world"]

the string constant C<"Hello world"> is detached and stored in the
string constant table; at the same time, the key constant itself is
detached and stored in the key constant table, leaving:

    set_s_p_kc S1, P1, 1

with C<"Hello world"> as entry 1 in the string constant table, and
the two bytes representing "string" and "1" stored as entry 1 in the
key constant table. See the bytecode PDD for information about how to
represent key constants.

=head2 Emit Bytecode

Finally, with the source in its most machine-readable format and the
constant tables already populated, bytecode can be emitted, as specified
in the bytecode PDD.

=head1 ATTACHMENTS

=head1 FOOTNOTES

=head1 REFERENCES

PDD 13: Bytecode

-- 
The Blit is a nice terminal, but it runs emacs.

Assembler PDD

Reply via email to