This is a draft of the assembly/assembler PDD; it ought to make sense without the bytecode PDD which it often refers to. I'll write that next, and follow up with an assembler, assuming vague agreement to this and that my bloody flu clears up soon.
Comments welcome! =head1 TITLE The Parrot Assembly Language =head1 VERSION 1 =head2 CURRENT Maintainer: Simon Cozens <[EMAIL PROTECTED]> PDD Number: Version: 1 Status: Proposed Last Modified: Fri Mar 1 15:50:00 GMT 2002 PDD Format: 1 Language: =head2 HISTORY =over 4 =item version 1 None. First version =back =head1 CHANGES =over 4 =item Version 1.0 None. First version =back =head1 ABSTRACT This PDD defines the specifications for the Parrot assembly language; the implementation section will describe the implementation of a reference model Parrot assembler. =head1 DESCRIPTION The Parrot assembly language is a textual representation of a program to be run on the Parrot interpreter. It is not the only possible representation, and indeed it is hoped that most uses of Parrot will directly produce bytecode through a compiler interface. However, Parrot will ship with both an assembler and disassembler for human modification of the bytecode. =head2 Standard Format A typical line of assembler consists of an optional label, an optional operation followed by optional comma-delimited arguments, followed by an optional comment. Labels are indicated as alphanumeric characters followed immediately by a colon. The colon is not taken to be part of the label. If the same label is defined more than once, the assembly is invalid, and assemblers may reject it. Comments begin with a hash sign (#) and this marker, together with everything else on the line following the hash sign, is ignored. =head2 Operations Operations are specified as alphanumeric characters, and are taken from the loaded oplibs. The form of operations is the same as it is in the oplibs: C<name_t_t_t> where C<t> represents the type of the given parameter. (See the next section.) Operations may optionally be specified in an abbreviated format, just the C<name> above; the assembler must derive the full name of the operation from the parameters specified. For instance, set I1, 1 is equivalent to set_i_ic I1, 1 The conversion between C<set> and C<set_i_ic> is a required phase of thet assembler's implementation. =head2 Parameters Parameters are of the following types: =over 3 =item B<i>nteger register =item B<i>nteger B<c>onstant =item B<n>umeric register =item B<n>umeric B<c>onstant =item B<s>tring register =item B<s>tring B<c>onstant =item B<k>ey B<constant> =item B<r>egister specifier =back The letters in bold face show the abbreviation for the parameter which makes up the name of the operation as described above. In accordance with the bytecode PDD, constant parameters other than integer constants are represented in the bytecode as an integer index into the appropriate constant table; registers are represented as an integer, and register specifiers are emitted by encoding the register type in the top two bits and the register number in the bottom six bits. See the bytecode PDD for more details on this. A label may be used as a parameter, in which case the label is to be taken as an integer constant representing the number of tokens between the C<beginning> of the current op and the site of the label's definition. =head2 Keyed parameters As well as the parameter types mentioned above, there are compound types based on keyed parameters, as described in the Keys PDD. =head2 Pragmata Lines beginning with a period (.) followed by some text are B<pragmata> and are reserved. No pragmata are currently defined. =head2 Macros Parrot assembly has no I<native> macro capability. Implementations are free to add their own macro capabilities, although they must B<not> do so using the pragmata syntax. =head1 IMPLEMENTATION A reference implementation of the Parrot assembler is provided. This section briefly discusses the operation of this assembler. The assembler goes through the following phases in converting assembly source text to bytecode: =over 3 =item Detach keys =item Resolve labels =item Expand ops =item Separate constants =item Emit bytecode =back =head2 Detach Keys The first stage is to replace key accesses such as P1[S1] with something like this: P1, S1 It must be remembered that in this case C<S1> is acting as a register specifier, so the function prototype would be C<..._p_r_..>. Similarly, the assembler must mark the fact that C<P1["foo"]> refers to a key constant: simply translating this to C<P1, "foo"> would cause C<"foo"> to be erroneously detected as a string constant, leading to the incorrect function prototype C<p_sc> rather than the correct C<p_kc>. The pseudo-assembly C<P1, [k:"foo"]>, C<P1, [k:S1]> is encouraged to make this distinction; however, Parrot assemblers need not accept this format as valid input. This is done immediately for several reasons: firstly, it means that at this stage every token in the assembly represents one "token" (C<sizeof(opcode_t)> bytes) in the bytecode, and therefore calculating relative offsets for branches can be done next. It also means that constants inside key indices can be shared; furthermore, this step can be done as a preprocessing stage without affecting further stages of assembly. =head2 Resolve Labels The next pass over the assembly turns label references into integer constants. It does this by counting the number of tokens between the beginning of the current op and the definition of the label. For instance: SPIN: inc I1 lt I1, 100, SPIN When assembling the second line, the occurrence of C<SPIN> refers to the label which is two tokens previous to the current op, C<lt>. Hence, C<SPIN> is replaced by C<-2>. lt I1, 1, ERROR gt I1, 100, ERROR branch OK ERROR: print "Register out of bounds" end OK: In this case, the definition of C<ERROR> is 9 tokens after the start of the first line and 6 tokens after the start of the second line; after label resolution, this code would be equivalent to: lt I1, 1, 9 gt I1, 100, 6 branch 4 print "Register out of bounds" end Again, this stage simply translates a more human-friendly version of the assembly to a more machine-readable variant, and can be performed as a separate stage without altering the "assemblability" of the code. =head2 Expand Ops Now we must replace the short names of ops with their parameter-encoded variants. As described above set S1, "Hello, Parrot\n" is here replaced by set_s_sc S1, "Hello, Parrot\n" The original plan was to perform the next step at this point, but then I realised that if we do it this way around, we don't have to retain information about what type the constant is, and we get rid of the whole silly C<set_s_sc S1, [s:1]> thing. =head2 Separate Constants This next step does alter the validity of the code; all non-integer constants, including key constants and string or floating-point constants inside of key constants are stored in a constant table and replaced by their index into the constant table. For instance, the following code: set_s_sc S1, "Hello, Parrot\n" print_s S1 end is replaced with set_s_sc S1, 1 print_s S1 end and the string C<"Hello, Parrot"> is saved in slot one of the string constant table. Assemblers B<must> share constants, so that: set_s_sc S1, "Hello, Parrot\n" print_sc "Hello, Parrot\n" is replaced with set_s_sc S1, 1 print_sc 1 and a second entry in the constant table is not shared. As a reminder, non-integer constants inside key accesses are also detached at this point. Hence, in set_s_p_kc S1, P1, [k:"Hello world"] the string constant C<"Hello world"> is detached and stored in the string constant table; at the same time, the key constant itself is detached and stored in the key constant table, leaving: set_s_p_kc S1, P1, 1 with C<"Hello world"> as entry 1 in the string constant table, and the two bytes representing "string" and "1" stored as entry 1 in the key constant table. See the bytecode PDD for information about how to represent key constants. =head2 Emit Bytecode Finally, with the source in its most machine-readable format and the constant tables already populated, bytecode can be emitted, as specified in the bytecode PDD. =head1 ATTACHMENTS =head1 FOOTNOTES =head1 REFERENCES PDD 13: Bytecode -- The Blit is a nice terminal, but it runs emacs.