[RFC] extension guessing, functionally better loader behavior -> working install target

Mike Mattie Sun, 13 May 2007 22:47:59 -0700

Hello,

I have been working on implementing extension guessing consistently in parrot.
These changes make parrot much more usable, robust, flexible, and maintainable.


Usable:

the current parrot implementation requires the extension to be specified. First
what is a extension ? An extension is just a few extra characters tacked on
to a path. All things being right an extension implies a file format.

In parrot however a file extension is much more. It indicates which stage
of compilation for a module. A module may have multiple stages cached on
disk.

foo.pir  <- source
foo.pbc  <- bytecode compiled

The parrot implementation is completely backwards in that the user of
module "foo" cannot simply use "foo". The user has to explicitly hardwire
which stage of compilation they want along with the module name itself.

In using parrot there is no good reason for the compilation stage to
matter. (I know about the jit issues on web-servers, it is not relevant).

In fact having this information "filter-down" from the request to load
a module has broken the install target. There are several cases where
someone does ".load_bytecode "foo.pir"" because in the working-copy
they have both foo.pbc and foo.pir. In the install tree only
foo.pbc is installed.

So parrot is not able to load code that exists on disk, because parrot
must be explicitly told the exact compilation stage along with the
module, and some compilation stages aren't always useful (intermediate)
or available.

Two behavioral rules can be formulated to solve this problem:

Rule 1. When a user requests a module, parrot will load that module using
        whatever format/loader is available. (dlopen, bytecode loaders, 
compilers)

Rule 2. When a module is requested , for performance the most compiled form
        of that module will be chosen.

   This is in fact the behavior of perl5 , and I think it should be
   the behavior of perl6. In fact in discussing this on #perl6 someone
   mentioned that there is already perl5 code that relies on this behavior 
(strange?).

At this point the extension must be *dropped*. That last sentence is critical.
All of the patches submitted so far have not changed parrot's behavior in tree
because these changes do not take effect until the instances of .load_bytecode
are changed to drop the extension.

This is a perfect incremental migration situation. I have been very careful to
first try a path as-given before any extensions are appended. 

If a flag day is desirable to get rid of all the extension crud in the tree, 
the .pir code is so syntactically simple that a perl5 program could likely 
perform 
the detection and the alterations of the statements with a human review of the 
changes.

At this point some parrot developers on #parrot expressed some resistance to 
having
compiled forms preferred over source forms. With the current behavior they don't
have to run "make clean" before re-testing changes to code.

Some could consider this pure laziness. I don't think so myself. Dynamic 
languages
should be dynamic. I thought this over and decided that an environment variable 
was
a good solution.

Rule 3:  PARROT_PREFER_SOURCE when this environment variable is exported parrot
         will reverse it's normal preference for low-level compiled forms , and
         prefer high level source forms.

An environment variable is ideal. It's not a switch that has to be specified or
coded somewhere. It can be set permanently, or it can be exported in a shell.
It can be set before a make, or just in one shell session.

A switch may also be necessary, environment variables are very powerful and easy
to manage under UNIX(tm)-like environments. Windows makes environment variables
harder, but not impossible. If a windows user objects to this environment 
variable
part of the proposal please suggest something that would work well. I don't want
to make windows 2cnd class. 

Currently the value if any of PARROT_PREFER_SOURCE is ignored. If more 
flexibility
is required with a sane use case it could be enhanced to take a path 
specification
:foo:bar:baz: of directories where source is preferred over compiled forms.

AFAIK this is a unique feature, and allows parrot to be both very dynamic and 
very
efficient. The user has the flexibility to do both according to their 
requirements.

Review:

There is an implementation available that was designed to be simple, and 
minimally
intrusive so people could review the features without getting tangled in a large
pile of changes. 

18357:18446 src/library.c    : implement extension guessing
18454:18482 src/library.c    : PARROT_PREFER_SOURCE

Note how much is duplicated between the src/library.c implementation
of Parrot_get_runtime_file_str and src/dynext.c get_file. It's gross.
My more intrusive patch in-progress will address the API problems.

the connection between the extensions and the install targets was written
up with much verbosity in:

http://xrl.us/vr9h

Robust:

The issue with parrot failing to load code because only a .pbc file was 
available,
and .pir was requested was already covered. The rest of the robustness issues 
have
to do with internals and I will cover that in maintenance.

Flexible:

I am working on making parrot more flexible by allowing languages/compilers
to have a "namespace" within the loader. 

Please do *not* tie this part to the rest. It only exists in my working-tree 
and is easily ripped out of the rest of the proposal.

This is a more speculative feature, but I think a good one. While reading
pdd21 concerning HLL name-spaces and interoperability I decided to try
the time-machine experiment.

Fast-fowarding to a future where parrot rules the earth I see parrot
having byte-code loaders for a range of languages: java, CLR, python,
perl5, perl6, etc.

Each language has it's own runtime, a set of libraries, architecture
objects (machine-code) , bytecode objects, and source files. Parrot
can interpret all of these but there is no reason to re-implement them
all from source.

If each language could have a "namespace" within the loader then the
java runtime distributed by Sun/whoever could be used by parrot
without any collisions for the wheels that everyone has to re-invent
like string,file,io etc.

Rule: when a loader namespace for a language has not been defined
      the default namespace "parrot" is used. If a lookup fails
      within the parrot namespace the load fails.

RFC: I noticed compreg, and quickly scanned through HLLCompiler.
     compiler implies either a translation stage, a sequence of
     translation stages, or a language.

     Has the meanings been refined architecturally somewhere ?

Basically the lib_paths global which is currently built like this

fixed-array[
  paths,      -> resizable array of strings
  extensions, -> resizable array of strings (note how parrot already implements 
extension guessing)
]

becomes this:

hash keyed by namespace {

  parrot -> fixed array of loaders [
     ARCH     /*dlopen loader*/       -> [ ... ]
     BYTECODE /* bytecode loaders */  -> [ ... ]
     SOURCE   /* source compilers */  -> fixed array [
                                         SEARCH_PATH  -> resizable array of 
strings
                                         SEARCH_EXT   -> resizable array of 
strings
  ]
}

With this new structure parrot has enough flexibility that it can construct a 
search space
for any language distribution, and can use them all within the same parrot 
instance without
collisions in the search space between languages.

It could also be used to implement binary compatibility. If "parrot" is 
versioned , say
as "parrot-pre" "parrot1" etc then the loader could support selecting a 
compatible version
of multiple runtime installs.

Maintainability:

This issue will get a bit more involved. the parrot loader is very alpha, aka 
put
together early in the development process. It let people explore the rest of 
the design 
space but a refactor is apparent throughout the code and API.

First let's focus on Parrot_locate_runtime_str.

current HEAD has this library.h:

typedef enum {
    PARROT_RUNTIME_FT_LIBRARY = 0x0001,
    PARROT_RUNTIME_FT_INCLUDE = 0x0002,
    PARROT_RUNTIME_FT_DYNEXT  = 0x0004,
    PARROT_RUNTIME_FT_PBC     = 0x0010,
    PARROT_RUNTIME_FT_PASM    = 0x0100,
    PARROT_RUNTIME_FT_PIR     = 0x0200,
    PARROT_RUNTIME_FT_PAST    = 0x0400,
    PARROT_RUNTIME_FT_SOURCE  = 0x0F00
} enum_runtime_ft;

There is one valuable idea to keep from this enum:

DYNEXT,LIBRARY,INCLUDE,SOURCE,

there are four basic loaders for parrot.

ARCH    : the platform loader for machine-code shared objects. aka ld
INCLUDE : macro/include pre-processing, link-editing on a translation unit 
level.
LIBRARY : bytecode loaders. parrot can support multiple bytecode loaders, 
extension will depend on language.
SOURCE  : something compiled

These are fundamental distinctions of interpretation that are sound across the 
current computing landscape.
We have link-loaders (machine specific), byte-code loaders (link editor 
internal to VM), and compilers:
generates objects for linking. INCLUDE is a special case for SOURCE, but 
necessary.

my new version looks like this:

/* enum_runtime_ft
 *
 * There are four basic paths for the loader.
 *
 * ARCH      : link-editor for an architecture shared object (machine code)
 * BYTECODE  : link-editor for bytecode linked into the virtual machine's
 *             op lists
 * INCLUDE   : a source form linked by a pre-processor creating 
translation-units
 *             for compilation
 * SOURCE    : source code compiled by the HLL framework
 *
 * These different paths for the loader are necessary to
 * resolve collisions in the library search space. For example
 * a module may have both a NCI part, and a HLL part:
 *
 * foo.so , foo.pbc
 */

typedef enum  {
    PARROT_RUNTIME_FT_ARCH     = 0x0001,
    PARROT_RUNTIME_FT_BYTECODE = 0x0002,
    PARROT_RUNTIME_FT_INCLUDE  = 0x0004
    PARROT_RUNTIME_FT_SOURCE   = 0x0006,
    PARROT_RUNTIME_FT_SIZE     = 4
} enum_runtime_ft;


by behavioral rule 1 Parrot should load whatever it can. 
Parrot_locate_runtime_file_str is a routine
that does the discovery of what is available. First cut would eliminate the 
distinction altogether,
pass of the discovery list to heuristic checks, and then select a loader.

However it is essential to keep the distinction between loaders at this level. 
A simple case would be
sqlite or a similar db wrapper. It likely has a ARCH component that glues the 
DB API to the languages
NCI API. It also has a language file that will export the interface and provide 
convenience/features
enhancing the DB API.

I this case loading a library ( a higher level concept than .load_bytecode ) 
would have a collision. This
scenario is not one file selected from a set of candidates, but two.

In the scenario of best form selected from candidates, multiple loaders can be 
selected in the mask
(think .pir | .pbc ) . In the case of more than one loader/format to completely 
load a module a 
single loader can be selected eliminating legitimate collisions that would 
parts of a multiple-format
module unreachable.

The enumeration of PBC etc is gone. Heuristics should be abstracted into a 
different stage of
loading. Each loader should provide header magic for a common routine to 
implement. This is punted
because parrot is simple enough. I want to fix library.c first without bogging 
down in a new
layer.

enum_lib_paths:

This chunk below should simply not be in a header. It should be in the .c file. 
Other modules
need to access the information from iglobal->lib_paths, but they should do it 
through functions
provided in library.c there should be a library.pir or something like that for 
accessing
the information on a parrot level. 

typedef enum {
    PARROT_LIB_PATH_INCLUDE,            /* .include "foo" */
    PARROT_LIB_PATH_LIBRARY,            /* load_bytecode "bar" */
    PARROT_LIB_PATH_DYNEXT,             /* loadlib "baz" */
    PARROT_LIB_DYN_EXTS,                /* ".so", ".dylib" .. */
    /* must be last: */
    PARROT_LIB_PATH_SIZE
} enum_lib_paths;

I am already feeling the pain from the lack of insulation here. I am doing
a discovery in the rest of the tree for how this is used, more later on this.

This is the main focus of the effort.

PARROT_API STRING* Parrot_locate_runtime_file_str(Interp *, STRING *file_name,
        enum_runtime_ft);

The role is weakly defined.

<proposal>
Parrot_locate_runtime_file_str performs a search to find the best available form
of a code object.

PARROT_API STRING* Parrot_locate_runtime_file_str(Interp *,
                                                  STRING *object_name,
                                                  STRING *hll,
                                                  enum_runtime_ft *loader);

file_name is now object_name. A file name is the result of this function, not 
the input. 

The hll argument is the key to the HLL name-space. If the HLL name-space does 
not exist
or is null the default name-space is used. The default name-space is "parrot".

loader is passed as a pointer to a modifiable enum_ft_loader variable. As an 
argument
it is a bit-mask of loaders to consider when discovering a object file path. As 
a return
value it is the loader chosen. 

The return value is the preferred object's path, or NULL if not found. Note 
that the 
returned path string has a hidden 0 making it suitable for direct use in C API 
calls 
(artifact of previous implementation). If NULL is returned the value of *loader
is semantically NULL, possibly modified, and should be reset by subsequent 
calls.

The object_name is first tried as given, and then by extension guessing. Further
location attempts are influenced by the search path and extension lists in
iglobal[IGLOBAL_LIB_PATHS]. These lists are examined recursively breadth-first, 
by loader, by search paths, and then extensions. The order of examination is 
influenced
by the PARROT_PREFER_SOURCE environment variable. When the variable is not
set The lowest level forms of the object will be tried up to the highest
level bounded by the loader mask. When the environment variable is defined
this order is reversed.

TODO: the extension , which is actually the stage of interpretation contained
      by the format is returned in the extension of the file. This should be
      returned as a optimization hint to heuristics.

TODO: instead of a string that is checked by stat() , a handle should be
      returned instead to close the classic access() race. Additional
      flags are needed for that such as NO_TTY and other basic cross-platform
      security checks. <-- huge warning. This should be a list within the search
      spaces index.

TODO: OS IO/VM hinting. some loaders could benefit from IO hinting such as
      mapped/streamed, use-once etc. depends on returning a handle and open 
flags.

current parrot behavior can be achieved by passing NULL as the hll argument,
and a mask of PARROT_RUNTIME_FT_BYTECODE & PARROT_RUNTIME_FT_INCLUDE & 
PARROT_RUNTIME_FT_SOURCE
for parrot bytecode, and PARROT_RUNTIME_FT_ARCH where previously 
PARROT_RUNTIME_FT_{DYNEXT,LIBRARY}
was requested.
</proposal>

This change properly defines the role of Parrot_locate_runtime_file_str with 
enough flexibility to
make it suitable for hiding all the implementation evil from a variety of 
higher level interfaces
such as .include , .load_bytecode etc.

Of particular benefit is gutting src/dynext.c:114 (get_path) which is almost a 
complete duplication
of Parrot_get_runtime_file_str's algorithm because extension guessing is 
implemented there. 
When get_path is considered , extension-guessing is not new behavior , rather a 
re-factor
of existing behavior to build a single API, documented/implemented in one 
place, that
provides safe/secure implementation consistent across loaders. HLL name-spacing 
is
a true feature on top of that re-factor.

Refactoring parrot_init_library_paths:

This re-factor can be implemented independent of the Parrot_locate_runtime_str 
work. This completes
the changes necessary in parrot internals to get the install target to work the 
same as the working-copy.

Currently parrot_init_library instantiates in a very tedious way

    paths = pmc_new(interp, enum_class_ResizableStringArray);
    VTABLE_set_pmc_keyed_int(interp, lib_paths,
            PARROT_LIB_PATH_INCLUDE, paths);
    entry = CONST_STRING(interp, "runtime/parrot/include/");
    VTABLE_push_string(interp, paths, entry);
    entry = CONST_STRING(interp, "runtime/parrot/");
    VTABLE_push_string(interp, paths, entry);
                  ...........

It generates a table of paths within the working-copy, and a table for the 
install. It also has a hook
for vendors to append to the default search space. This is the crux of the 
working-tree and the install
being the same. Parrot_locate_runtime_str provides a virtual unified search 
space. When people request
an object such as "PGE" , or "PGE/util" the burden of hiding the difference 
between the paths in the
two trees is hardcoded here by hand with the parrot internal API in C.

I have ripped this out completely and replaced it with this:

#include "builtin-loader-paths.c"

void
parrot_init_library_paths(Interp *interp)
{
    PMC *iglobals, *lib_paths;

    if( query_load_prefer(interp) )
        load_prefer = PREFER_SOURCE;

    lib_paths = pmc_new(interp, enum_class_Hash);

    populate_builtin_library_paths(interp, lib_paths);

    iglobals = interp->iglobals;
    VTABLE_set_pmc_keyed_int(interp, iglobals,
                             IGLOBALS_LIB_PATHS, lib_paths);
}

I have a function with this signature that performs the traversal
of the new hll namespace'd lib_paths , creating intermediary data
structures as needed, and populating the structure.

static void
populate_search_space(Interp* interp,
                      /* the loader table for the namespace */
                      PMC* load_table,
                      enum_runtime_ft loader,

                      /* search space index */
                      enum_search_space search_space,

                      /* the entry to add */
                      STRING* entry)

the populate_builtin_library_paths now looks like this:

    loader_table = get_load_table_for_populate(interp, lib_paths, ns );

    entry = CONST_STRING(interp, "runtime/parrot/dynext/");
    populate_search_table(interp, loader_table, PARROT_RUNTIME_FT_ARCH, 
SEARCH_TABLE_PATH, entry );

and is contained in builtin-loader-paths.c which is a generated source created 
from a input file
looking like this:

[parrot]

# note: the search ./ entries can be used to discover who has not
#       migrated to this format. by removing this entry any part
#       of the tree not using a .paths file will break.

#----------------------------------------------------------------------
# shared objects
#----------------------------------------------------------------------

loader arch

install runtime/parrot/dynext/
build lib/parrot/dynext/

dlopen load so

#----------------------------------------------------------------------
# bytecode objects
#----------------------------------------------------------------------

loader bytecode

install runtime/parrot/include/
build   lib/parrot/include/

install runtime/parrot/library/
build   lib/parrot/library/

install runtime/parrot/
build   lib/parrot/

build ./

pbc load pbc

the .ini [\w+] part is the hll name-space to populate.
loader specifies the current loader. Paths are either
install|build identifying which tree they are for. The
table can be populated from both or filtered.

the "load" lines specify extensions and associates a
interpreter with that extension. In the future it
could be possible to load all sorts of fun things
from a high level API, such as PGE grammars, resume
compilation by loading intermediate stages, and pretty
much anything else by replacing extensions as information
with Parrot_locate_runtime_file_str returning these three
things: 

path -> handle, loader, compiler/ HLL compile phase GUID.
 
This is a really primitive perl5 filter at present.
I will convert it to a program that recurses the working-copy
and merges a set of files placed across the tree. The syntax
is a hack while the requirements are refined.

When a part of the parrot tree wants to be in the search path,
a file is created at the root of where a search space is desired.
it can contain lines the file above. the build lines would not
be needed, they could be calculated from the file's path in the tree
relative to the working-copy root.

Also with the information present in both the file's location and contents
all of the information needed by a install target is present. The extensions
can be copied, sub-trees preserved etc. (TODO)

The extensions and the phase information could later be extended for
processing by other programs to generate HLLCompiler integration so
the the loader aspect does not get separated. A 
HLLCompiler-integration-generator
may be a worthy TODO.

The potential for the file is to integrate installation,loading, and maybe
even HLLCompiler integration in a single place that can be edited with
zero knowledge of parrot internals, only architecture.

With this approach I completely ripped out

#ifdef PARROT_PLATFORM_LIB_PATH_INIT_HOOK
    PARROT_PLATFORM_LIB_PATH_INIT_HOOK(interp, lib_paths);
#endif

which was a pretty high bar for a vendor to build a search space. Now they
can edit a simple data file that gets merged when present.

I have a foo.pl currently but it is a hack to get me too tinkering with the
build system. It will probably need someone familiar with that system to
parrot'ize it.

As far as integration goes, this file is currently just #include'd in
src/library.c

A more polished approach would be to build it as a separate object. In that
way a table for the working-copy parrot, and a leaner install parrot could
be made at the linking phase.

The End. 

With the extension-guessing, and the new infrastructure for
populating the path/extension tables, it should be possible to update
the install target and get a functionally equivalent parrot.

Thanks for reading this. Comments are welcome.

More intrusive patch is in-progress. Note that the current patches I have 
attached 
in RT in conjuction with a deprecation of extensions can move things forward 
while 
I work on the more in-depth API issues.

Since my patches were going against the trunk I need to introduce changes 
incrementally,
I hope this clarifies my goals sufficiently that the merit of the changes can 
be fully
appreciated, and addressed at a higher level than patches.

Cheers,
Mike Mattie - [EMAIL PROTECTED]

signature.asc
Description: PGP signature

[RFC] extension guessing, functionally better loader behavior -> working install target

Reply via email to