Re: [RFC] extension guessing, functionally better loader behavior -> working install target

Mike Mattie Fri, 25 May 2007 03:12:16 -0700

Here is the first part of my response. I have dropped the parts
about re-factoring and the library.paths parts so we can get
on the same page design wise.


Here goes ....

I wanted to reply to this before you left on vacation, but Thunderbird
crashed taking several unfinished replies with it. (Fresh install, which
I hadn't yet configured to automatically save drafts.)

So, the abbreviated version...


no problem. Right now I am really hobbled with the gmail interface. This
coming Saturday I will be back in the saddle with all my files and a real
mail client.

Mike Mattie wrote:
Hello,

I have been working on implementing extension guessing consistently

in parrot.

These changes make parrot much more usable, robust, flexible, and

maintainable.


Usable:

the current parrot implementation requires the extension to be

specified. First

what is a extension ? An extension is just a few extra characters tacked on
to a path. All things being right an extension implies a file format.

In parrot however a file extension is much more. It indicates which stage
of compilation for a module. A module may have multiple stages cached on
disk.

foo.pir  <- source
foo.pbc  <- bytecode compiled

The parrot implementation is completely backwards in that the user of
module "foo" cannot simply use "foo". The user has to explicitly hardwire
which stage of compilation they want along with the module name itself.

In using parrot there is no good reason for the compilation stage to
matter. (I know about the jit issues on web-servers, it is not relevant).

In fact having this information "filter-down" from the request to load
a module has broken the install target. There are several cases where
someone does ".load_bytecode "foo.pir"" because in the working-copy
they have both foo.pbc and foo.pir. In the install tree only
foo.pbc is installed.

This can be solved by simply referencing the .pbc file and building the
PBC in the make process for a particular subsystem. Which is only to say
that automatic extension selection is an optional refinement, not a core
requirement.


For perl5 this will be a requirement. In fact I think this is where
communication of the design is breaking down (my fault).

When I set out to solve this problem I did not want to build a crutch
to hop over the problem I faced personally ; my proposal is a very
general solution that I hope encompasses the entire spectrum of
languages that parrot will support in the future. I hope that this
will result in substantial forward progress, and be in concordance
with the theme of parrot as a general solution for languages.

When I expanded my problem domain I overloaded the topic. I should
have been more clear before , but this discussion will hammer out the
scope of the proposal.

When it comes to loading source code for interpretation of any sort
there are common issues: security, OS optimization, well
understood/transparent search behavior, and configuration.

For security/optimization platforms have similar solutions with
slightly different flags/calls etc.  The OS optimization needs to
happen in C (open flags), and security (open flags, search algorithm,
input validation) needs to be implemented in one place. Verification
by review is hard enough without having to search all over the source
tree in a variety of languages (C/PIR mix would be bad) to comprehend
the implementation. This leads directly into fragmentation.

When more languages are implemented with 100% compatability with the
original platforms that the process for locating source files and
loading them can become fragemented - losing the essential sense of
parrot "cohesiveness".

This is natural because programmers tend to scan API documentation and
source code looking for what they need but do not want to implement.
Looking at the API for what it is in terms of an abstraction and
amending both their implementation plan and the API itself is far more
exhausting.

I think the implementation of dynext.c supports this. The code is
almost a complete duplication of the primary search loop in
library.c. The fundamental algorithm both share is combinitorial ;
which I have hoisted and named properly (path_concat_permutations)
in my proposed library.c implementation (2cnd posting).

This tendancy towards divergance regardless of motive and the hassle
it causes is the reason the linux kernel has resisted what they
describe as the "balkanization" of the scheduler implementation. This
is parrot, not Linux but the design principle both valuable, and
applicable here IMHO. When configuration is brought into the picture
this balkanization has real problem potential.

I Think of a common parrot behavior for languages on a VM level as a
"look and feel" issue as well. This is addressed in the new API by the
search trace diagnostic for when the search fails. When all of the
related opcodes use this API, and the diverse languages supported by
parrot rely on these opcodes parrot centric expectations of behavior
for all languages is acheivable in the loading part of a language
implementation.

So parrot is not able to load code that exists on disk, because parrot
must be explicitly told the exact compilation stage along with the
module, and some compilation stages aren't always useful (intermediate)
or available.

Two behavioral rules can be formulated to solve this problem:

Rule 1. When a user requests a module, parrot will load that module using
        whatever format/loader is available. (dlopen, bytecode

loaders, compilers)


Rule 2. When a module is requested , for performance the most compiled form
        of that module will be chosen.

   This is in fact the behavior of perl5 , and I think it should be
   the behavior of perl6. In fact in discussing this on #perl6 someone
   mentioned that there is already perl5 code that relies on this

behavior (strange?).

My take on this is that we should have two opcodes. One that tries to
work out the extension for you, and one that is quite literal-minded.
When the "smart loader" isn't sufficiently smart, the code can fall back
to the literal-minded loader.


All of my committed/proposed changes have tried the "literal" version
*first* , then tried extension guessing as a fallback. This is more
reliable since the code can over-ride the heuristic approach when
necessary , which may not possible with the "fall-back to literal"
behavior described above. Bad heuristics that prevent a literal input
from working are a most frustrating software behavior.

The "literal first" approach also preserves existing behavior
perfectly and is the reason why my changes AFAIK have not broken the
parrot tree. I am politely adamant on this particular point.

For the sake of sane migration,
load_bytecode should continue to work as it always has, and we come up
with a new name for the new opcode. (load_bytecode is a misleading name
anyway.)


.load_bytecode: I am not talking about changing the behavior of this
op-code at all for existing code. In fact much of the confusion
probably stems from the strong desire of mine to attempting to preserve
both existing behavior of this opcode and developer habits ; except the
PARROT_PREFER_SOURCE part.

That's ok because .load_bytecode is the first user for the new version
of this function. If it can't support *all* the bytecodes that
implement object-file search-spaces by searching of a list of
directories then it's broken IMHO and I need to fix it.

Eventually I would love to see load_bytecode narrowed in scope or
removed ; it is horribly overloaded. I hope that my new implementation
of Parrot_locate_runtime_str will make this easier by moving a well
abstracted part of the problem out of one op into a routine that is
more easily shared by multiple ops.

Rule 3:  PARROT_PREFER_SOURCE when this environment variable is

exported parrot

         will reverse it's normal preference for low-level compiled

forms , and

         prefer high level source forms.

An environment variable should not be used to select the behavior of
Parrot opcodes. If both behaviors are useful, then provide both as
separate opcodes.


.load_bytecode has been influenced by an environment variable, and has
for ages: PARROT_RUNTIME. I am raising it as a design issue and
properly documenting it. I do not want lapse into a nerdy pendantic
and obtuse response so I will assume you mean "no new environment
variable influences. PERL5LIB and friends is on the horizon though.

The point of PARROT_PREFER_SOURCE was simply to eliminate the
objections that stemmed from people who did not want to run make clean
im-between revisions of the code. I tried to make something useful out
of it instead of immediately rejecting those concerns.

I think this case is situation where the user may want to change the
behavior without re-compiling code: transiently , permenently, and by
"session". Environment variables are process inherited and are AFAIK
the only cross-platform configuration mechanism with this kind of
flexibility. I can start two shells , set the defaults in a shell
configuration file such as ".profile" , change it in one shell
temporarily with a single command issued once , and discard that
change by simply exiting the shell. This is do-able on windows as
well, but not as common a practice due to the weak CLI environment.

I think PARROT_PREFER_SOURCE is very nice and useful by the Larry Wall
laziness principle but I will drop it if no one else sees sufficient
benifit.

Flexible:

I am working on making parrot more flexible by allowing languages/compilers
to have a "namespace" within the loader.

Please do *not* tie this part to the rest. It only exists in my working-tree
and is easily ripped out of the rest of the proposal.

This is a more speculative feature, but I think a good one. While reading
pdd21 concerning HLL name-spaces and interoperability I decided to try
the time-machine experiment.

Fast-fowarding to a future where parrot rules the earth I see parrot
having byte-code loaders for a range of languages: java, CLR, python,
perl5, perl6, etc.

Each language has it's own runtime, a set of libraries, architecture
objects (machine-code) , bytecode objects, and source files. Parrot
can interpret all of these but there is no reason to re-implement them
all from source.

If each language could have a "namespace" within the loader then the
java runtime distributed by Sun/whoever could be used by parrot
without any collisions for the wheels that everyone has to re-invent
like string,file,io etc.

I halfway get the impression that you're working backwards here. You
want to make extensions irrelevant, but once you do that, you need some
way to distinguish between different languages, so you add the
distinctions back in as directory hierarchy.


Extensions: they are an optimization hint/feature

I never take the extension to be anything but a optimization hint.
What a file contains should be determined by inspection. That being
said I think parrot can be very slick here.

I was specifically inspired by "pheme.g": the PGE grammar for pheme. I
thought to myself why does the build system have to generate all this
intermediate junk on disk ? It clutters the build and the tree on disk
because parrot needs hand-holding at the build-system level to walk it
through the translation phases ; completely ignoring the HLL
infrastructure. Why can't parrot just see a ".g", assume until
something goes wrong that it's a grammar, use the HLLcompiler
infrastructure to run through the translation phases , and then link
it in ?

If parrot can do that , then caching the translation phases ie
compilation should just be a matter of stopping translation at a
specific phase and outputting to disk with the right extension. When
the install is performed the most compiled version is copied, and
"pheme.g" is left in the source tree.

With the whole extension guessing thing finding the "preferred
extension" is finding *the optimal first phase of translation*. The
HLLcompiler infrastructure is primed by that vital information to
produce an executable form.

For example: Say I make a directory for each of the major phases of
transation, parse, OST etc. In each directory I have the .g file,
or the .tge file. I have a "super-op" called ".load_it" . In the
core pheme I have something like ".load_it grammar/pheme".

When I am working on the grammar in the source tree I can change the
grammar file and re-run the interpreter core without re-compiling the
interpreter core - it will run through all the translation phases
every time I run it. This is nice for development.

When I am finished developing and do the build/install it will pick up
the compiled version in the install tree and use that, which is
performance optimal for a system-wide install.

This one implementation of Parrot_locate_runtime_file_str does the
discovery of what's available, aka finding the available object-file
forms and selecting the most optimal starting phase of the
translation to an internally executable form.

Per-language search space:

If parrot is going to become "one VM to rule them all" , then it will
need "one loader to load them all". Parrot_locate_runtime_file_str is
not the loader - it does the discovery of what is available to load. In
that way it is "one search implementation to discover them all".

I want to support *all* languages with Parrot_locate_runtime_file_str
- "Do it in one place". Python has it's own tree of modules
distributed along with "/usr/bin/python". So does every other
non-trivial language with an extensive library.

The ideal future I am envisioning uses parrot as a "drop in replacement"
for the interpreter, while using the existing, even compiled libraries
for those languages. That way I don't have to keep current on how the
same issues are solved in different VM's. I can focus on one VM: parrot,
and see the fixes and features propogated through all the languages
I use on a regular basis.

Examples: PERL5LIB , PYTHONLIB, ELISPLIB - they are all search spaces
for specific languages. language interoptability doesn't happen until
the loader can function correctly as defined by an individual language
; what is loaded can then be intergrated at a calling convention
level.

There is some provision to specify a custom library that is loaded when
the HLL is selected in the second argument to .HLL. It's limited, and
not really used AFAIK.


I am not aware of what you are talking about with the custom library.
I skimmed over it ;  I freely admit that I do it too :) .

Rule: when a loader namespace for a language has not been defined
      the default namespace "parrot" is used. If a lookup fails
      within the parrot namespace the load fails.

What's the distinction between loader namespace and Parrot namespace?


Calling it a loader name-space is a garbling of terms that I will
hopefully correct in a very precise way . Having the concept of
loaders disjunct to translation phases is a vital aquiescence to the
current practice of a single module, say sha1 implemented at both a
low-level and a high level or C/byte-code.

Before the relationship between a loader and a language is established
the class of loaders and their kind must be established.

fundamentum divisionis: loaders are a translation phase where linking
can be performed with minimal residual analysis of the higher level
forms.

In simplest terms a loader is a translation phase where the objects
are functionally both interchange-able and opaque to the loader.

Two examples:

The C pre-processor forms a translation unit for compilation via the
#include directive. This recursive combination of seperate files into
a single stream for syntatic anaylsis is done with only the lexing
necessary to identify CPP directives and expand them.

On the level of the link-loader the relocation fix-ups are performed
without any knowledge of the original source except the residual
symbol export/import tables. Only the addresses of load/store/branch
instructions are significant, not what the instruction semantics.

My four kinds of loader:

OS (was ARCH),   : the operating system loader
BYTECODE         : byte-code object-files for various VM's such as python,java
INCLUDE          : .PAST level include processing
SOURCE           : triggers imcc/HLLCompiler driven translation

This is not logically exhaustive - rather de-facto useful.

The OS kind is distinct because it is largely implemented external to
parrot.

The BYTECODE kind is distinct because the translation to a
internally executable form is based on a regular input format,
hence byte-code ; contrast with SOURCE.

The INCLUDE kind is distinct because the processing is deferred to
post-compilation (ie SOURCE) with parrot, and limited to imperative
style macro language, and .include directive processing.

The SOURCE kind is distinct because it triggers syntax driven context
free grammar translation via HLLCompiler? -> imcc -> bytecode.

the old enum_runtime_ft was a ad-hoc division of object/source file
types , with a bit-mask union classification as OS | SOURCE in the svn
HEAD version of library.c .

The analysis above is more clean and useful on a logical level and
more precisely defined IMHO. I hope the application of logical
vocabulary will help clarify rather than obscure the proposal.

Returning to the relationship between loader and language they
are disjunct because the language uses loaders as containers and
a library module as a whole can be implemented at different levels
for a variety of reasons such as performance , access to C level
API's, and the deferring of CPP style macro processing.

Each loader has it's own search-space. This is because the operating
system shared objects can and sometimes should be stored in trees
seperate from byte-code and source objects. Forcing a single
search-space with a common tree root for seperate object-file types
can limit privelage seperation at the operating system level. Sets
of extensions are specific to a kind of loader.

The order in which loaders are tried is important. For the sha1
example the OS level or C implementation needs to be loaded
before the bytecode level NCI wrappers are loaded. Without the
distinction between loaders in the API, and with a first match
search the byte-code part of the library module would not be
reachable.

This is realized in the HEAD version of library.c with the seperate
code paths depending on the classification OS | SOURCE . My goal
was not to remove this , rather to clean it up.

With the my new implementation of library.{ch} the kinds of loaders
are general with well defined distinctions. It is possible for both
dynext and .load_bytecode to use a well insulated
Parrot_locate_runtime_file_str simply by passing an appropriate mask.

RFC: I noticed compreg, and quickly scanned through HLLCompiler.
     compiler implies either a translation stage, a sequence of
     translation stages, or a language.

     Has the meanings been refined architecturally somewhere ?

Basically the lib_paths global which is currently built like this

fixed-array[
  paths,      -> resizable array of strings
  extensions, -> resizable array of strings (note how parrot

already implements extension guessing)

]

becomes this:

hash keyed by namespace {

  parrot -> fixed array of loaders [
     ARCH     /*dlopen loader*/       -> [ ... ]
     BYTECODE /* bytecode loaders */  -> [ ... ]
     SOURCE   /* source compilers */  -> fixed array [
                                         SEARCH_PATH  -> resizable

array of strings

                                         SEARCH_EXT   -> resizable

array of strings

  ]
}

With this new structure parrot has enough flexibility that it can

construct a search space

for any language distribution, and can use them all within the same

parrot instance without

collisions in the search space between languages.

This doesn't quite work because you have to be able to load one
language's libraries from another language. So, you need to be able to
load Python's Mail.Filter and Perl's Mail::Filter (fictional examples)
at the same time and use them both within the same program.


hash keyed by namespace -> hash keyed by language

you did not understand what I meant by "hash keyed by namespace" I
think. I should not use name-space anymore leaving that to the
pdd21 scope issues - my blunder.

I am renaming the argument: STRING* hll -> STRING* language
so things will not be so confused anymore.

But with that example you understand why I want a "per language
search-space". Hopefuly we are on the same page now.

The current implementation supports loading them both, that is a
primary goal. Using them both at the same time is a namespace issue
addressed by pdd21 which is far beyond the scope of the loader.

Another point of confusion is that I had a hard time finding where
parrot encapsulated the concept of a "language". When looking I found
HLLCompiler which seemed to define a language as a sequence of
translation phases. This is the most useful definition of a language
from the perspective of parrot as far as I can see. At this moment
for me HLL = language.

What I want to do is have Parrot_locate_runtime_file_str look
by object_name, by language, and return what is found within
the masked loaders.

That way you can search for a object, in the python language, first by
shared objects, then by bytecode objects etc. Having a mask for the
loaders also allows the loading to be split or combined by opcodes at
a higher level of design ; a level that I am not informed to comment
on yet.

What I really need to know is how to translate this "language" string
into a value that will key the translation/loading machinery for
parrot. All of the SOURCE level loading should be a sequence of translation
phases starting with the best available object-file form and ending with
imcc.

The directories on disk correspond to the Parrot namespace of the
libraries as a convention. You could potentially optimize the loading
operation by having a load of a Python module only search the Python HLL
directory. But, a user-defined module might not follow the convention.


The fact that a user defined module may not follow the convention is
not a problem. With my search algorithm the standard library locations
will be searched first so that trusted implementations are always
preferred.  If nothing is found in the standard library locations the
object_name is then tried as a path relative to the current working
directory. I have mentioned things like PERL5LIB. I can easily
implement that sort of thing but that is not integral to the core
design, merely a mechanism to extend the search-space for a particular
language. Considering parrot's current limitations I don't think this
is an immediate priority if the design is okay.

Similarly, there is a convention (not entirely consistent) that foo.pbc
is the compiled form of foo.pir, but that's not always the case, and
certainly not required.


I agree, that is why I am firm on extensions being an optimization
hint.  In particular the "use v6;" is always present in my
consideration of how loading needs to be designed and implemented.

It could also be used to implement binary compatibility. If

"parrot" is versioned , say

as "parrot-pre" "parrot1" etc then the loader could support

selecting a compatible version

of multiple runtime installs.

What you haven't addressed (and what I consider the most important
problem to solve for library loading), is a mechanism for extending
Parrot's search path.


I think I have extended it quite far :) Joking aside you are right
if you are talking about a .pir wrappers to manipulate the search-space
data structures. Again this is not essential to the core-design and
can be added after a working concencus is reached.

If you are talking about supporting multiple languages This is why
I want:

* per language search spaces , was STRING* hll, now STRING* language to
 clear up the confusion.

* extension guessing ( doesn't matter what language provides the
functions anymore
                      unless you care. but that is handled at a
higher level, at
                      some point it needs to be explicit, and it is for
                      Parrot_locate_runtime_file_str )

* integrate with HLLCompiler infrastructure = intergrate with whatever
 encapsulates the machinery for a language.

If that were defined, then versioning would be a simple matter of
selecting an appropriate search path.


lets call it search-space , and per-language search space so I can
stop confusing you :) Versioning in the language key is simply a cheap
side-benefit I thought of.

Maintainability:

This issue will get a bit more involved. the parrot loader is very

alpha, aka put

together early in the development process. It let people explore

the rest of the design

space but a refactor is apparent throughout the code and API.

This section is a mixture of code refactor ideas and architecture ideas.
Would be simpler to process the two separately, but I'll take a stab.


- Show quoted text -

I think we need to first get on the same page by agreeing on terms
before drilling down much further. I apologize for the instances where
I have confused you since we are clearly articulating the same goals,
but not communicating in architecture terms clearly.

It would really help if there was a place in parrot that encapsulated
the entire scope of a language. I discovered .compreg as the closest
thing I could find to something like that. I also assumed pdd21 was
where I would find the architecture of that encapsulation.

I also wonder if you had the time to look at the second draft of
library.c . I noticed that you are very familiar with the current
implementation. As far as my new implementation I don't think
you had time to review the code yet.

I make a particular point of writing code that is readable. I value
highly a discussion at the design level, but if the code is not the
optimal way to understand the design then I am disappointed in how I
wrote the code. I consider a critique of the code transparency to be
as important as the design. In particular the "tagging" of where the
steps of the algorithm is implemented was an attempt to make review
easier.

There are parts of the code that I can now re-factor with clarity
using a more consistent definition of terms. I will do that within the
next couple of days. This response should be enough to ponder for a bit.

I hope at the least that you see that I am working on the general
loading libraries for multiple languages problem, and that I have a
design that is being refined by this process towards a good
implementation.

Re: [RFC] extension guessing, functionally better loader behavior -> working install target

Reply via email to