Hi,
The current format of the debug segment in Parrot packfiles (.pbc files), as
documented in doc/parrotbyte.pod, only allows for a single source file to be
named. This became insufficient some time ago since we had .include
directives; it also means that there's nothing sensible that pbc_merge can
do with the debug segments it finds in input files.
WHAT WE HAVE NOW
Currently, we store two things:-
1) The filename of a single source file, as an additional field in the
header
2) The line number in the source file for each bytecode instruction, as the
segment's opcode stream
WHAT SOURCE?
The debug segment as we currently have it relates to PIR and PASM source
files, not to high level language source files. Currently PIR parses a
directive that looks like this:
#line 'filename'
This is for compilers to supply the line numbers and file names of HLL
source files. Currently, nothing is done with these directives after they
are parsed, but the data they provide should go into a seperate HLL debug
segment.
As the needs of the PASM/PIR debug segments and the HLL debug segments would
seem to be the same, this proposal will detail a single format that should
work for both of them. If it is determined that the HLL debug segment needs
something more sophisticated, this proposal still stands for the PASM/PIR
debug segment.
SOURCE SEGMENTS
This is currently mentioned in parrotbyte.pod; the idea would seem to be
that this segment can contain source code. I suspect the intention of it
was to store the source code of high level languages rather than PASM or
PIR. I think the doc is correct in stating that this segment is currently
unused. However, in the future it likely will be, so it makes sense to
consider its future existence now while re-designing the debug segment(s).
FORMAT PROPOSAL
The aims of the new format, intended for both the PASM/PIR debug segment and
the HLL debug segment are:
1) Supporting multiple input files
2) Allowing for a reference into the source segment in place of a filename.
3) Still being space-efficient on disk
The opcode stream will contain one line number per bytecode instruction. No
information as to what file that line is in will be stored in this stream.
(This is pretty much the same as what we have now).
The header (after the standard stuff that every header has) will start with
a count of the number of source file to bytecode position mappings that are
in the header.
0 (relative)
+----------+----------+----------+----------+
| number of source => bytecode mappings |
+----------+----------+----------+----------+
A source to bytecode position mapping simply states that the bytecode that
starts from the specified offset up until the offset in the next mapping, or
if there is none up until the end of the bytecode, has it's source in
location X.
A mapping always starts with the offset in the bytecode, followed by the
type of the mapping.
0 (relative)
+----------+----------+----------+----------+
| bytecode offset |
+----------+----------+----------+----------+
4
+----------+----------+----------+----------+
| mapping type |
+----------+----------+----------+----------+
There are 3 mapping types.
Type 0 means there is no source available for the bytecode starting at the
given offset. No further data is stored with this type of mapping; the next
mapping continues immediately after it.
Type 1 means the source is available in a file. A NULL terminated string
containing the filename follows.
Type 2 means the source is available in a source segment. Another integer
follows, which will specify which source file in the source segment to use.
Note that the ordering of the offsets into the bytecode must be sequential;
a mapping for offset 100 cannot follow a mapping for offset 200, for
example.
COMPATIBILITY
This change is incompatible with the current debug segment format. But
that's OK, we're still in development.
Comments on this would be very welcome, even if it's as simple as "looks OK
to me" or "looks terrible to me". :-)
Thanks,
Jonathan