Hello, I have been working on implementing extension guessing consistently in parrot. These changes make parrot much more usable, robust, flexible, and maintainable.
Usable: the current parrot implementation requires the extension to be specified. First what is a extension ? An extension is just a few extra characters tacked on to a path. All things being right an extension implies a file format. In parrot however a file extension is much more. It indicates which stage of compilation for a module. A module may have multiple stages cached on disk. foo.pir <- source foo.pbc <- bytecode compiled The parrot implementation is completely backwards in that the user of module "foo" cannot simply use "foo". The user has to explicitly hardwire which stage of compilation they want along with the module name itself. In using parrot there is no good reason for the compilation stage to matter. (I know about the jit issues on web-servers, it is not relevant). In fact having this information "filter-down" from the request to load a module has broken the install target. There are several cases where someone does ".load_bytecode "foo.pir"" because in the working-copy they have both foo.pbc and foo.pir. In the install tree only foo.pbc is installed. So parrot is not able to load code that exists on disk, because parrot must be explicitly told the exact compilation stage along with the module, and some compilation stages aren't always useful (intermediate) or available. Two behavioral rules can be formulated to solve this problem: Rule 1. When a user requests a module, parrot will load that module using whatever format/loader is available. (dlopen, bytecode loaders, compilers) Rule 2. When a module is requested , for performance the most compiled form of that module will be chosen. This is in fact the behavior of perl5 , and I think it should be the behavior of perl6. In fact in discussing this on #perl6 someone mentioned that there is already perl5 code that relies on this behavior (strange?). At this point the extension must be *dropped*. That last sentence is critical. All of the patches submitted so far have not changed parrot's behavior in tree because these changes do not take effect until the instances of .load_bytecode are changed to drop the extension. This is a perfect incremental migration situation. I have been very careful to first try a path as-given before any extensions are appended. If a flag day is desirable to get rid of all the extension crud in the tree, the .pir code is so syntactically simple that a perl5 program could likely perform the detection and the alterations of the statements with a human review of the changes. At this point some parrot developers on #parrot expressed some resistance to having compiled forms preferred over source forms. With the current behavior they don't have to run "make clean" before re-testing changes to code. Some could consider this pure laziness. I don't think so myself. Dynamic languages should be dynamic. I thought this over and decided that an environment variable was a good solution. Rule 3: PARROT_PREFER_SOURCE when this environment variable is exported parrot will reverse it's normal preference for low-level compiled forms , and prefer high level source forms. An environment variable is ideal. It's not a switch that has to be specified or coded somewhere. It can be set permanently, or it can be exported in a shell. It can be set before a make, or just in one shell session. A switch may also be necessary, environment variables are very powerful and easy to manage under UNIX(tm)-like environments. Windows makes environment variables harder, but not impossible. If a windows user objects to this environment variable part of the proposal please suggest something that would work well. I don't want to make windows 2cnd class. Currently the value if any of PARROT_PREFER_SOURCE is ignored. If more flexibility is required with a sane use case it could be enhanced to take a path specification :foo:bar:baz: of directories where source is preferred over compiled forms. AFAIK this is a unique feature, and allows parrot to be both very dynamic and very efficient. The user has the flexibility to do both according to their requirements. Review: There is an implementation available that was designed to be simple, and minimally intrusive so people could review the features without getting tangled in a large pile of changes. 18357:18446 src/library.c : implement extension guessing 18454:18482 src/library.c : PARROT_PREFER_SOURCE Note how much is duplicated between the src/library.c implementation of Parrot_get_runtime_file_str and src/dynext.c get_file. It's gross. My more intrusive patch in-progress will address the API problems. the connection between the extensions and the install targets was written up with much verbosity in: http://xrl.us/vr9h Robust: The issue with parrot failing to load code because only a .pbc file was available, and .pir was requested was already covered. The rest of the robustness issues have to do with internals and I will cover that in maintenance. Flexible: I am working on making parrot more flexible by allowing languages/compilers to have a "namespace" within the loader. Please do *not* tie this part to the rest. It only exists in my working-tree and is easily ripped out of the rest of the proposal. This is a more speculative feature, but I think a good one. While reading pdd21 concerning HLL name-spaces and interoperability I decided to try the time-machine experiment. Fast-fowarding to a future where parrot rules the earth I see parrot having byte-code loaders for a range of languages: java, CLR, python, perl5, perl6, etc. Each language has it's own runtime, a set of libraries, architecture objects (machine-code) , bytecode objects, and source files. Parrot can interpret all of these but there is no reason to re-implement them all from source. If each language could have a "namespace" within the loader then the java runtime distributed by Sun/whoever could be used by parrot without any collisions for the wheels that everyone has to re-invent like string,file,io etc. Rule: when a loader namespace for a language has not been defined the default namespace "parrot" is used. If a lookup fails within the parrot namespace the load fails. RFC: I noticed compreg, and quickly scanned through HLLCompiler. compiler implies either a translation stage, a sequence of translation stages, or a language. Has the meanings been refined architecturally somewhere ? Basically the lib_paths global which is currently built like this fixed-array[ paths, -> resizable array of strings extensions, -> resizable array of strings (note how parrot already implements extension guessing) ] becomes this: hash keyed by namespace { parrot -> fixed array of loaders [ ARCH /*dlopen loader*/ -> [ ... ] BYTECODE /* bytecode loaders */ -> [ ... ] SOURCE /* source compilers */ -> fixed array [ SEARCH_PATH -> resizable array of strings SEARCH_EXT -> resizable array of strings ] } With this new structure parrot has enough flexibility that it can construct a search space for any language distribution, and can use them all within the same parrot instance without collisions in the search space between languages. It could also be used to implement binary compatibility. If "parrot" is versioned , say as "parrot-pre" "parrot1" etc then the loader could support selecting a compatible version of multiple runtime installs. Maintainability: This issue will get a bit more involved. the parrot loader is very alpha, aka put together early in the development process. It let people explore the rest of the design space but a refactor is apparent throughout the code and API. First let's focus on Parrot_locate_runtime_str. current HEAD has this library.h: typedef enum { PARROT_RUNTIME_FT_LIBRARY = 0x0001, PARROT_RUNTIME_FT_INCLUDE = 0x0002, PARROT_RUNTIME_FT_DYNEXT = 0x0004, PARROT_RUNTIME_FT_PBC = 0x0010, PARROT_RUNTIME_FT_PASM = 0x0100, PARROT_RUNTIME_FT_PIR = 0x0200, PARROT_RUNTIME_FT_PAST = 0x0400, PARROT_RUNTIME_FT_SOURCE = 0x0F00 } enum_runtime_ft; There is one valuable idea to keep from this enum: DYNEXT,LIBRARY,INCLUDE,SOURCE, there are four basic loaders for parrot. ARCH : the platform loader for machine-code shared objects. aka ld INCLUDE : macro/include pre-processing, link-editing on a translation unit level. LIBRARY : bytecode loaders. parrot can support multiple bytecode loaders, extension will depend on language. SOURCE : something compiled These are fundamental distinctions of interpretation that are sound across the current computing landscape. We have link-loaders (machine specific), byte-code loaders (link editor internal to VM), and compilers: generates objects for linking. INCLUDE is a special case for SOURCE, but necessary. my new version looks like this: /* enum_runtime_ft * * There are four basic paths for the loader. * * ARCH : link-editor for an architecture shared object (machine code) * BYTECODE : link-editor for bytecode linked into the virtual machine's * op lists * INCLUDE : a source form linked by a pre-processor creating translation-units * for compilation * SOURCE : source code compiled by the HLL framework * * These different paths for the loader are necessary to * resolve collisions in the library search space. For example * a module may have both a NCI part, and a HLL part: * * foo.so , foo.pbc */ typedef enum { PARROT_RUNTIME_FT_ARCH = 0x0001, PARROT_RUNTIME_FT_BYTECODE = 0x0002, PARROT_RUNTIME_FT_INCLUDE = 0x0004 PARROT_RUNTIME_FT_SOURCE = 0x0006, PARROT_RUNTIME_FT_SIZE = 4 } enum_runtime_ft; by behavioral rule 1 Parrot should load whatever it can. Parrot_locate_runtime_file_str is a routine that does the discovery of what is available. First cut would eliminate the distinction altogether, pass of the discovery list to heuristic checks, and then select a loader. However it is essential to keep the distinction between loaders at this level. A simple case would be sqlite or a similar db wrapper. It likely has a ARCH component that glues the DB API to the languages NCI API. It also has a language file that will export the interface and provide convenience/features enhancing the DB API. I this case loading a library ( a higher level concept than .load_bytecode ) would have a collision. This scenario is not one file selected from a set of candidates, but two. In the scenario of best form selected from candidates, multiple loaders can be selected in the mask (think .pir | .pbc ) . In the case of more than one loader/format to completely load a module a single loader can be selected eliminating legitimate collisions that would parts of a multiple-format module unreachable. The enumeration of PBC etc is gone. Heuristics should be abstracted into a different stage of loading. Each loader should provide header magic for a common routine to implement. This is punted because parrot is simple enough. I want to fix library.c first without bogging down in a new layer. enum_lib_paths: This chunk below should simply not be in a header. It should be in the .c file. Other modules need to access the information from iglobal->lib_paths, but they should do it through functions provided in library.c there should be a library.pir or something like that for accessing the information on a parrot level. typedef enum { PARROT_LIB_PATH_INCLUDE, /* .include "foo" */ PARROT_LIB_PATH_LIBRARY, /* load_bytecode "bar" */ PARROT_LIB_PATH_DYNEXT, /* loadlib "baz" */ PARROT_LIB_DYN_EXTS, /* ".so", ".dylib" .. */ /* must be last: */ PARROT_LIB_PATH_SIZE } enum_lib_paths; I am already feeling the pain from the lack of insulation here. I am doing a discovery in the rest of the tree for how this is used, more later on this. This is the main focus of the effort. PARROT_API STRING* Parrot_locate_runtime_file_str(Interp *, STRING *file_name, enum_runtime_ft); The role is weakly defined. <proposal> Parrot_locate_runtime_file_str performs a search to find the best available form of a code object. PARROT_API STRING* Parrot_locate_runtime_file_str(Interp *, STRING *object_name, STRING *hll, enum_runtime_ft *loader); file_name is now object_name. A file name is the result of this function, not the input. The hll argument is the key to the HLL name-space. If the HLL name-space does not exist or is null the default name-space is used. The default name-space is "parrot". loader is passed as a pointer to a modifiable enum_ft_loader variable. As an argument it is a bit-mask of loaders to consider when discovering a object file path. As a return value it is the loader chosen. The return value is the preferred object's path, or NULL if not found. Note that the returned path string has a hidden 0 making it suitable for direct use in C API calls (artifact of previous implementation). If NULL is returned the value of *loader is semantically NULL, possibly modified, and should be reset by subsequent calls. The object_name is first tried as given, and then by extension guessing. Further location attempts are influenced by the search path and extension lists in iglobal[IGLOBAL_LIB_PATHS]. These lists are examined recursively breadth-first, by loader, by search paths, and then extensions. The order of examination is influenced by the PARROT_PREFER_SOURCE environment variable. When the variable is not set The lowest level forms of the object will be tried up to the highest level bounded by the loader mask. When the environment variable is defined this order is reversed. TODO: the extension , which is actually the stage of interpretation contained by the format is returned in the extension of the file. This should be returned as a optimization hint to heuristics. TODO: instead of a string that is checked by stat() , a handle should be returned instead to close the classic access() race. Additional flags are needed for that such as NO_TTY and other basic cross-platform security checks. <-- huge warning. This should be a list within the search spaces index. TODO: OS IO/VM hinting. some loaders could benefit from IO hinting such as mapped/streamed, use-once etc. depends on returning a handle and open flags. current parrot behavior can be achieved by passing NULL as the hll argument, and a mask of PARROT_RUNTIME_FT_BYTECODE & PARROT_RUNTIME_FT_INCLUDE & PARROT_RUNTIME_FT_SOURCE for parrot bytecode, and PARROT_RUNTIME_FT_ARCH where previously PARROT_RUNTIME_FT_{DYNEXT,LIBRARY} was requested. </proposal> This change properly defines the role of Parrot_locate_runtime_file_str with enough flexibility to make it suitable for hiding all the implementation evil from a variety of higher level interfaces such as .include , .load_bytecode etc. Of particular benefit is gutting src/dynext.c:114 (get_path) which is almost a complete duplication of Parrot_get_runtime_file_str's algorithm because extension guessing is implemented there. When get_path is considered , extension-guessing is not new behavior , rather a re-factor of existing behavior to build a single API, documented/implemented in one place, that provides safe/secure implementation consistent across loaders. HLL name-spacing is a true feature on top of that re-factor. Refactoring parrot_init_library_paths: This re-factor can be implemented independent of the Parrot_locate_runtime_str work. This completes the changes necessary in parrot internals to get the install target to work the same as the working-copy. Currently parrot_init_library instantiates in a very tedious way paths = pmc_new(interp, enum_class_ResizableStringArray); VTABLE_set_pmc_keyed_int(interp, lib_paths, PARROT_LIB_PATH_INCLUDE, paths); entry = CONST_STRING(interp, "runtime/parrot/include/"); VTABLE_push_string(interp, paths, entry); entry = CONST_STRING(interp, "runtime/parrot/"); VTABLE_push_string(interp, paths, entry); ........... It generates a table of paths within the working-copy, and a table for the install. It also has a hook for vendors to append to the default search space. This is the crux of the working-tree and the install being the same. Parrot_locate_runtime_str provides a virtual unified search space. When people request an object such as "PGE" , or "PGE/util" the burden of hiding the difference between the paths in the two trees is hardcoded here by hand with the parrot internal API in C. I have ripped this out completely and replaced it with this: #include "builtin-loader-paths.c" void parrot_init_library_paths(Interp *interp) { PMC *iglobals, *lib_paths; if( query_load_prefer(interp) ) load_prefer = PREFER_SOURCE; lib_paths = pmc_new(interp, enum_class_Hash); populate_builtin_library_paths(interp, lib_paths); iglobals = interp->iglobals; VTABLE_set_pmc_keyed_int(interp, iglobals, IGLOBALS_LIB_PATHS, lib_paths); } I have a function with this signature that performs the traversal of the new hll namespace'd lib_paths , creating intermediary data structures as needed, and populating the structure. static void populate_search_space(Interp* interp, /* the loader table for the namespace */ PMC* load_table, enum_runtime_ft loader, /* search space index */ enum_search_space search_space, /* the entry to add */ STRING* entry) the populate_builtin_library_paths now looks like this: loader_table = get_load_table_for_populate(interp, lib_paths, ns ); entry = CONST_STRING(interp, "runtime/parrot/dynext/"); populate_search_table(interp, loader_table, PARROT_RUNTIME_FT_ARCH, SEARCH_TABLE_PATH, entry ); and is contained in builtin-loader-paths.c which is a generated source created from a input file looking like this: [parrot] # note: the search ./ entries can be used to discover who has not # migrated to this format. by removing this entry any part # of the tree not using a .paths file will break. #---------------------------------------------------------------------- # shared objects #---------------------------------------------------------------------- loader arch install runtime/parrot/dynext/ build lib/parrot/dynext/ dlopen load so #---------------------------------------------------------------------- # bytecode objects #---------------------------------------------------------------------- loader bytecode install runtime/parrot/include/ build lib/parrot/include/ install runtime/parrot/library/ build lib/parrot/library/ install runtime/parrot/ build lib/parrot/ build ./ pbc load pbc the .ini [\w+] part is the hll name-space to populate. loader specifies the current loader. Paths are either install|build identifying which tree they are for. The table can be populated from both or filtered. the "load" lines specify extensions and associates a interpreter with that extension. In the future it could be possible to load all sorts of fun things from a high level API, such as PGE grammars, resume compilation by loading intermediate stages, and pretty much anything else by replacing extensions as information with Parrot_locate_runtime_file_str returning these three things: path -> handle, loader, compiler/ HLL compile phase GUID. This is a really primitive perl5 filter at present. I will convert it to a program that recurses the working-copy and merges a set of files placed across the tree. The syntax is a hack while the requirements are refined. When a part of the parrot tree wants to be in the search path, a file is created at the root of where a search space is desired. it can contain lines the file above. the build lines would not be needed, they could be calculated from the file's path in the tree relative to the working-copy root. Also with the information present in both the file's location and contents all of the information needed by a install target is present. The extensions can be copied, sub-trees preserved etc. (TODO) The extensions and the phase information could later be extended for processing by other programs to generate HLLCompiler integration so the the loader aspect does not get separated. A HLLCompiler-integration-generator may be a worthy TODO. The potential for the file is to integrate installation,loading, and maybe even HLLCompiler integration in a single place that can be edited with zero knowledge of parrot internals, only architecture. With this approach I completely ripped out #ifdef PARROT_PLATFORM_LIB_PATH_INIT_HOOK PARROT_PLATFORM_LIB_PATH_INIT_HOOK(interp, lib_paths); #endif which was a pretty high bar for a vendor to build a search space. Now they can edit a simple data file that gets merged when present. I have a foo.pl currently but it is a hack to get me too tinkering with the build system. It will probably need someone familiar with that system to parrot'ize it. As far as integration goes, this file is currently just #include'd in src/library.c A more polished approach would be to build it as a separate object. In that way a table for the working-copy parrot, and a leaner install parrot could be made at the linking phase. The End. With the extension-guessing, and the new infrastructure for populating the path/extension tables, it should be possible to update the install target and get a functionally equivalent parrot. Thanks for reading this. Comments are welcome. More intrusive patch is in-progress. Note that the current patches I have attached in RT in conjuction with a deprecation of extensions can move things forward while I work on the more in-depth API issues. Since my patches were going against the trunk I need to introduce changes incrementally, I hope this clarifies my goals sufficiently that the merit of the changes can be fully appreciated, and addressed at a higher level than patches. Cheers, Mike Mattie - [EMAIL PROTECTED]
signature.asc
Description: PGP signature