Re: To get things started...
In message <[EMAIL PROTECTED]> Dan Sugalski <[EMAIL PROTECTED]> wrote: > At 10:18 AM 11/21/00 -0800, Benjamin Stuhl wrote: > > >Well, it would (IMHO) make more sense to have > >perl6_parse_script (I do tend to follow > >{subsystem,verb,object} naming...) > > Or Perl$parse_script, but that's a matter of taste, I suppose. :) Given that it isn't a valid C identifier, yes... Unless you're using VAXC or DECC of course, which was your point I assume ;-) Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ ...Discoveries are made by not following instructions.
Perl 6 paper
This coming Saturday, I'm presenting a paper on Perl 6 (the story so far) at the Australian Open Source Symposium. Is anyone interested in looking over my notes and commenting on them in the next couple of days? K. -- Kirrily 'Skud' Robert - [EMAIL PROTECTED] - http://infotrope.net/ Today is the first day of the rest of your life. Give up now.
Re: Guidelines for internals proposals and documentation
At 02:45 PM 11/17/00 +, David Grove wrote: >Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > At 10:19 AM 11/17/00 -0800, Ken Fox wrote: > > >However, I don't want to see early (premature) adoption of fundamental > > >pieces like the VM or parser. It makes sense to me to explore many > > possible > > >designs and pick and choose between them. Also, if we can keep >external > > API > > >design separate from internal design I think we'll have more wiggle >room > > >to experiment. (That's one of the big problems with perl 5 right now.) > > > > That's one of the reasons I'd like to work on the APIs first. I realize > > > that doing even that will have an effect on the design of the pieces > > behind > > the APIs, but we have to startsomewhere. > >But.. but... but... we don't even have a design spec. I mean, we don't >even know for sure what Perl 6 is going to look like for certain, inside >or outside. Wouldn't we have to know the outside before we try to put the >insides together? No, not really. For the actual code we will, of course, but there's a lot we can do now. (And a good part of the parser could still be written now, since most of the changes will likely be reasonably trivial) The APIs perl presents to the world are pretty much independent of the language. For example, we can take a good stab at the extension API now--regardless of how the language looks, extensions will still need to get and set scalar, hash, and array values. Perl would have to change a *lot* for that to be no longer valid. The API presented to an embedding programs similarly can be worked on--the fact that the language might change doesn't alter the syntax of the run_perl_code() function. (or whatever we call it) We also do have, generally speaking, a picture of both perl (since Larry has said we're not gutting the language entirely) and the internal structure. I've been a bit lax in presenting that internal picture, but I'll fix that in a little bit. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Perl 6 paper
I would, certainly. But I also think that the group as a whole would enjoy the preview. Kirrily "Skud" Robert <[EMAIL PROTECTED]> wrote: > This coming Saturday, I'm presenting a paper on Perl 6 (the story so > far) at the Australian Open Source Symposium. > > Is anyone interested in looking over my notes and commenting on them > in the next couple of days? > > K. > > -- > Kirrily 'Skud' Robert - [EMAIL PROTECTED] - http://infotrope.net/ > Today is the first day of the rest of your life. Give up now. >
SvPV*
(I'm not sure if I've missed all the fun here before I subscribed, but I can't anything on the RFC list that mentions the following) perl5 has a tangle of SvPV macros to allow C code to get a pointer to the scalar. (or the "private", with or without the length, and more relating to utf8 that don't even appear to be documented) Has any thought yet been given to the API to get scalars? Jarkko posted an idea on p5p of "Virtual Values" which would permit a scalar to point to another scalar's buffer, rather than its own. Currently the perl5 API assumes that you get a read-write pointer, and that the thing it points to is "\0" terminated. This makes it hard to implement copy on write, or to allow a pointer to a sub-length of the parent scalar's buffer. IIRC Ilya mailed p5p bemoaning the fact that perl's SVs use a continuous buffer. A split-buffer representation (where a hole is allowed in the middle of the buffer data) permits much faster replacement type operations, as there is less copying, and you can move the hole around to suit your needs. So I was wondering if perl6 was going to replace SvPV* with something that allows the caller to say whether they'd like * read only or read write * buffer all in one block or can cope with a hole (plus tell me where it is) * null terminated buffer or don't care and possibly * data must be in utf8 or tell me what the data is in although this might be better done as caller specifies 1 or more acceptable encodings they could cope with, and SvPV* returns data in whatever requires least work to translate consistent with maintaining accuracy. In particular specifying read/write versus read only would allow perl to treat scalars as copy-on-write which would mean things like $a=$b wouldn't actually copy anything (wasting time and (shared) memory pages) until either $a or $b got changed. [I have this feeling that there's a bit of this already in sv.c, but I'm not sure how much] Nicholas Clark
Re: SvPV*
On Tue, Nov 21, 2000 at 05:04:32PM +, Nicholas Clark wrote: > (I'm not sure if I've missed all the fun here before I subscribed, but > I can't anything on the RFC list that mentions the following) > > perl5 has a tangle of SvPV macros to allow C code to get a pointer > to the scalar. (or the "private", with or without the length, and > more relating to utf8 that don't even appear to be documented) > > Has any thought yet been given to the API to get scalars? > > Jarkko posted an idea on p5p of "Virtual Values" which would permit a > scalar to point to another scalar's buffer, rather than its own. That was the other half, yes. The other half was it that a VV would point to a 'window' or 'slice' of the other scalar's buffer, not necessarily the whole buffer. > Currently the perl5 API assumes that you get a read-write pointer, and that > the thing it points to is "\0" terminated. This makes it hard to implement > copy on write, or to allow a pointer to a sub-length of the parent > scalar's buffer. What he said. > IIRC Ilya mailed p5p bemoaning the fact that perl's SVs use a continuous > buffer. A split-buffer representation (where a hole is allowed in the > middle of the buffer data) permits much faster replacement type operations, > as there is less copying, and you can move the hole around to suit your > needs. Yet another bummer of the current SVs is that they poorly fit into 'foreign memory' situations where the buffer is managed by something else than Perl. "No, thank you, Perl, keep your greedy fingers off this chunk. No, you may not play with it." -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: SvPV*
At 05:04 PM 11/21/00 +, Nicholas Clark wrote: >(I'm not sure if I've missed all the fun here before I subscribed, but >I can't anything on the RFC list that mentions the following) > >perl5 has a tangle of SvPV macros to allow C code to get a pointer >to the scalar. (or the "private", with or without the length, and >more relating to utf8 that don't even appear to be documented) > >Has any thought yet been given to the API to get scalars? Yup. The internal details will be hidden from the extension writer--if you do a get_string(PMC, UTF_8) you get back the UTF-8 encoded version of the scalar that PMC represents, regardless of any internal format. That way if some scalar function writer has some need to do odd things they can without having to worry about telling extension writers. It also means that an extension doesn't have to care that a PMC represents, say, a complex number--they ask for it in UTF-8 format and get back "4 + 3i" and that's fine. This isolation will also reduce cross-version breakage. While I'd like to eliminate that, I doubt it's entirely feasable. It'll be possible to get the gory details if you want them, but then you'll have to go a step lower in the API and, well, the docs say "Here there be dragons". Or they will, at least. One of the things we need to hammer out on the extension API list is exactly what sorts of things need to be generally exposed to extensions and what don't. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
Simon Cozens <[EMAIL PROTECTED]> wrote: > On Tue, Nov 21, 2000 at 10:37:23AM +, David Grove wrote: > > I'm not sure that it's possible to do this, or disirable. If Larry wants > > Perl to use different modes, creoles, or ways of interpreting or > > understanding the "perl" language, then we have to let the parser have a > > bit more information. > > Yes, but these don't have to be external level calls. > > > As a point of clarification, I am seeing the external parser as that part > > of perl that sees the user's script directly. > > Likewise. > > This is why the "creole" rulesets are *not* external calls. > > Syntax definitions, in the form of Perl programs or pragmata, will have to > go > through two stages before they can be used. First, they have to be parsed > as > Perl code. This is a call to the external API of the parser. > > Once this is done, the resulting op tree must be processed so that it can > be > turned into a data structure (representing the grammar) which can be > understood by the parser. Actually, I think I'm getting it. In my model, what you guys are calling "internal API" is basically what I'm leaving in that intermediate area, kinda sorta? The parser that I'm talking about is what receives the perl code. So: 1. The External API has a pure syntax that has already gone through a toplevel process to produce something identifiable to the "External API". 2. There remains a separation between the internal API and the external api. 3. I'm attempting to call that toplevel parser the "internal api", and need a word for it. I'd submit that, since the creole parser needs to speak to the internal API, it should become part of the spec for the entire parser. Does it make sense that the creole parser be on the top of that chart I made, and that the External API ends up what's in the middle? p
Re: To get things started...
At 12:44 PM 11/21/00 +, Simon Cozens wrote: >On Tue, Nov 21, 2000 at 07:36:11AM -0500, David Grove wrote: > > > * The parser needs to be reentrant > > No clue what this means. I need this defined in context. > >While parsing text, you should be able to dive into a separate bit of text, >parse that, ("re-enter" the parser's routines) come out and carry on >*exactly* where you left off, without your state being lost. And, even more so with perl, you may call out of the parser into a bit that then calls right back into it. For example: BEGIN { eval "\$foo = 12;"; } the parser finishes with the parsing of the BEGIN block and, before continuing (and thus we haven't exited the parser) calls into the compilation/exection modules, which then get the eval and call right back into the parser. (Which then calls the compilation/execution bits, which could potentially call into the parser again...) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
At 12:46 PM 11/21/00 +, David Grove wrote: >Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > At 10:37 AM 11/21/00 +, David Grove wrote: > > >Thanks for the clarifications, Simon. > > > > > >Simon Cozens <[EMAIL PROTECTED]> wrote: > > >If we were simply feeding it perl with a single syntax, we could get >away > > >with a "one call" scheme. But since we're dealing with almost >certainly > > >mutually exclusive syntax and semantics, it probably needs more > > >information. > > > > But we are. The call is probably going to be something like: > > > >status = parse_perl(perl_interpreter *my_interp, > > char *script, > >struct HIR *end_result, > >long flags); > > > > the fact that the script has a "use pythonish;" in it is entirely > > irrelevant--the program calls into the parser, which returns a status >and > > possibly a parsed representation of the program. The parser gets to >deal > > with all the grotty details. > >What form is this intermediary parsed representation in? API, right? Then >I need to clarify that when I say bytecode, I've meant whatever this >intermediary parsed representation is, be that pure perl, API, or >otherwise. Okay, you're more confused here than I though. API = Application Program Interface (More or less. Something like that, at least). Basically the list of function calls and their parameters, perhaps with a set of rules around what can be called when. It's perfectly appropriate to have some things left as magic cookies at this point, like the syntax tree format, though their general characteristics can be specified. It might well be that we want to define the format of the syntax tree now as well, though I'd have preferred to leave that for a little later. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
Dan Sugalski <[EMAIL PROTECTED]> wrote: > At 07:36 AM 11/21/00 -0500, David Grove wrote: > >However, one thing is seriously lacking in this theory... if the parser is > >perl, how does the perl parse? (Sort of a woodchuck chucking wood type of > >thing.) Somehow, the external parser API thingy has to know enough perl > >(through the chosen language) to be able to handle the parsing. > > Nope. We do it in two phases. The end result will not actually parse perl > code to build the parser (we'll provide bytecode for that) but to start we > can run the parser through perl 5 to get a syntax tree until the perl 6 > engine's capable of doing it itself. Hmmm, that sounds familiar... > >To quote my perl elders, whatever can be done without regexen should be > >done with index() (within limits, since some regexen can be quite > >optimized). > > No, not really. regexes are generally easier to comprehend than their > index > couterparts, and often faster. (There's a lot of code that needs to go > into > backtracking...) While index might be better sometimes we can't force > folks > to use it. Almost all of perl is up for grabs. I won't argue the point as long as it works, the point being that we do it with whatever method is capable of the greatest efficiency. > >The parser API needs to know both regexen and index() in order to work. > > The parser will have a fully-functional interpreter to work with. All of > perl will likely be there for it. (Modules and threads might not, but > that's still up in the air) But that "interpreter" will be in the form of API, right? > > > * The parser will have an active interpreter structure handy > > > >Is this the perl that parses the perl? > > Yup. In fact we might have two--the interpreter structure for the > interpreter running the parser, and the structure for the end-result > parsed > program. Or we might just use one and squirrel all the interpreter bits in > a private (and deletable) namespace somewhere. It's pretty clear that we're to purposely put in a distinct separation between the two, unless I misunderstood Larry on this. I'm cautious about dual-purposing anything here, since he said that this is a major problem in Perl 5 today (the lack of flexibility between either end). I'd like to ask for a clarification of the following terms as they apply here: 1. External API 2. Internal API 3. Parser 4. Interpreter 5. What seems to be my "toplevel" parser (the creole parser) 6. Bytecode 7. Syntax Tree And what language they should be in (if Larry's undefined language, just say C-Larry or something), what they input, and what they output, what what they input from and output to in terms of the next level of functionality. I think we're on the same wavelength, but not speaking the same language. I'd also like to offer an explanation. As I mentioned earlier, I've already been working on a perlish to perl translator, so the "toplevel creole parser" as is particularly interesting to me as something that I've basically already worked on, so it's where my head is. The different output modes as well, because they're just the top turned upside down... forgive my lack of attention to (and understanding of) the middleparts. p
Re: To get things started...
> Okay, you're more confused here than I though. I can't deny that, but at least I helped get this group talking. The silence was deafening. Participation feels good though, when I'm not getting yelled at for being technically inarticulate (P5P). Maybe if we can keep up the good attitudes, we can swamp the P5P in terms of active participants... Above all, thanks for your patience, Dan (and Simon). ;-)) p
Re: To get things started...
At 01:04 PM 11/21/00 +, David Grove wrote: >Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > At 07:36 AM 11/21/00 -0500, David Grove wrote: > > >However, one thing is seriously lacking in this theory... if the >parser is > > >perl, how does the perl parse? (Sort of a woodchuck chucking wood type >of > > >thing.) Somehow, the external parser API thingy has to know enough >perl > > >(through the chosen language) to be able to handle the parsing. > > > > Nope. We do it in two phases. The end result will not actually parse >perl > > code to build the parser (we'll provide bytecode for that) but to start >we > > can run the parser through perl 5 to get a syntax tree until the perl 6 > > > engine's capable of doing it itself. > >Hmmm, that sounds familiar... Sure. Compilers have been doing it for decades. > > >To quote my perl elders, whatever can be done without regexen should >be > > >done with index() (within limits, since some regexen can be quite > > >optimized). > > > > No, not really. regexes are generally easier to comprehend than their > > index > > couterparts, and often faster. (There's a lot of code that needs to go > > into > > backtracking...) While index might be better sometimes we can't force > > folks > > to use it. Almost all of perl is up for grabs. > >I won't argue the point as long as it works, the point being that we do it >with whatever method is capable of the greatest efficiency. As long as everyone understands that efficiency doesn't necessarily mean the code that executes the fastest. While I want the parser fast, it is generally a one-shot thing, and if it takes an extra millisecond or twelve that probably doesn't make much difference. Cutting a day or twelve off of the preliminary development time, though, does matter rather a lot more. > > >The parser API needs to know both regexen and index() in order to >work. > > > > The parser will have a fully-functional interpreter to work with. All >of > > perl will likely be there for it. (Modules and threads might not, but > > that's still up in the air) > >But that "interpreter" will be in the form of API, right? No. The API is just a set of functions. I mean an iterpreter, a real entity that can do something. Pretty much the same as an interpreter instance in perl 5. > > > > * The parser will have an active interpreter structure handy > > > > > >Is this the perl that parses the perl? > > > > Yup. In fact we might have two--the interpreter structure for the > > interpreter running the parser, and the structure for the end-result > > parsed > > program. Or we might just use one and squirrel all the interpreter bits >in > > a private (and deletable) namespace somewhere. > >It's pretty clear that we're to purposely put in a distinct separation >between the two, unless I misunderstood Larry on this. You probably misunderstood a little. I don't think Larry really cares how it works as long as it does. If the parser leaves a lot of cruft in the _Parser namespace it likely matters not. >I'm cautious about >dual-purposing anything here, since he said that this is a major problem >in Perl 5 today (the lack of flexibility between either end). > >I'd like to ask for a clarification of the following terms as they apply >here: > >1. External API The functions presented to the world at large, including other parts of perl. (The bytecode compiler, the optimizer, and the interpreter, specifically) >2. Internal API The functions, hooks, and spots for hooks presented to the code inside parser. >3. Parser The piece of perl that takes a stream of source and emits a syntax tree. >4. Interpreter The piece of perl that takes a chunk of bytecode and executes it. >5. What seems to be my "toplevel" parser (the creole parser) Got me there. >6. Bytecode Perl's machine code. The stuff that gets fed to the interpreter. >7. Syntax Tree The parsed, tokenized, and cleaned-up version of the source. See the dragon book (or any good compiler book) for more details. >And what language they should be in (if Larry's undefined language, just >say C-Larry or something) The parser shoud be mostly perl. The rest will be in a mix of something Cish and perl. (The Cish stuff will likely be run through a perl filter to produce real C, though it'll hopefully have features in it that'll rein in some of C's more error-prone features) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
Thanks for the clarifications, Simon. Simon Cozens <[EMAIL PROTECTED]> wrote: > On Tue, Nov 21, 2000 at 07:36:11AM -0500, David Grove wrote: > > > 1) The API presented to the rest of the world. This is likely one > call, > > > > These are almost two separate things entirely. (I don't get the "one > call" > > thing. What do you mean?) > > A parser does, essentially, one single thing: it takes text and turns it > into > an op tree. That's the only call you need to make from an external > perspective. I'm not sure that it's possible to do this, or disirable. If Larry wants Perl to use different modes, creoles, or ways of interpreting or understanding the "perl" language, then we have to let the parser have a bit more information. This includes the ability to tell it what creole it's currently interpreting (it will probably need a stack for that, since I can foresee people trying "use tclish" within a "use pythonic", unless one overrides the other and turns off the previous mode: in which case it just needs to know its current mode. It also needs to know where to get its information. If we want a small kernel, then we can't give it information about how to parse the different modes within the micro-kernel itself. It would have to be bound to the kernel, or loaded as a file. If we were simply feeding it perl with a single syntax, we could get away with a "one call" scheme. But since we're dealing with almost certainly mutually exclusive syntax and semantics, it probably needs more information. As a point of clarification, I am seeing the external parser as that part of perl that sees the user's script directly. > > the external API needs to be flexible to handle perl in different writing > > styles > > This doesn't need to be the case; the external API may be > language-agnostic, > with the language rules set by internal calls. Then I'm misunderstanding the difference between external and internal. If external touches the user's script, it can't separate itself from whatever particular syntax is currently in use. The internal portion of the parser that I suppose I've now proposed is what can be user-syntax independent. I'm seeing the external api as being the part that receives the user's script from different modes and turns it into an intermediary , and the external as what takes the intermediary and turns it into whatever form of output that we've chosen. From what I understand of Larry's desires for the language, we need multiple possible ways to input, and multiple possible ways to output, but internally we need that language agnostic thing. Maybe I'm going beyond the purpose of "API". Let me know if this is the case. > > > * The parser needs to be reentrant > > No clue what this means. I need this defined in context. > > While parsing text, you should be able to dive into a separate bit of text, > parse that, ("re-enter" the parser's routines) come out and carry on > *exactly* > where you left off, without your state being lost. Thanks. That's basically what it sounded like. Can you give an example? I mean, are we expressing the need for do {BLOCK} with this, or threads, or multiplicity, or something else? > > perl6 perl5 python tclish > > \ \ / / > > \ \ / / > > --- > > READSTDIN and other commons > >full tree here > > --- > > | > > | <- required > > | > > --- > > OPCODES > > --- > > / /\ \ > > / / \ \ > > run store exea > > bc bc binaryjava thingy > > I think you've just invented the compiler! :) I don't think so. In a compiler I don't believe that the intermediate step is there, and I've never seen any compiler accept multiple input semantics and multiple output (meaning binary, bytecode, java, c#) (okay, C++ Builder can accept pascal... but that's an one). However, I've foreseen the output of compiled code as a part of this. The desire there was to make it easier to make a compiler, or at least possible to output executable code as an output mode. Thie exe-binary is actually, of course, several different pidgins within the creole, since we're outputting to Linux, Solaris, Win32, etc. ad nauseam... Keep in mind that I don't have a clear definition for the intermediate step except that it is desired to separate the external from the internal as I understand them. (For now, I'm conceptualizing it as a one-to-one that can change without hurting the internal or external, solely for the purpose of the desired flexibility and separation.) But then, I've yet to see whether I'm understanding them. I also realize I'm off the topic of the bytecode itself, but I'm not sure how much bytecode I can apply to an undefined language. Can somebody let me know if any of what I've said is r
Re: To get things started...
On Tue, 21 Nov 2000, David Grove wrote: > If we were simply feeding it perl with a single syntax, we could get away > with a "one call" scheme. But since we're dealing with almost certainly > mutually exclusive syntax and semantics, it probably needs more > information. Perhaps the "one call" can take some arguements? I suppose it would need to know what kind of syntax to expect. > Larry's desires for the language, we need multiple possible ways to input, > and multiple possible ways to output, but internally we need that language > agnostic thing. Bytecode, right? > I don't think so. In a compiler I don't believe that the intermediate step > is there It definitely is. Few optimizations are possible without an intermediate representation of some kind! > , and I've never seen any compiler accept multiple input semantics GCC - recently renamed the "Gnu Compiler Collection" for a reason! > and multiple output (meaning binary, bytecode, java, c#) (okay, C++ This also not uncommon - you can look at cross-compilers as one example. Java compilers that can produce bytecode and native code is another. > Can somebody let me know if any of what I've said is relevant? Highly relevent, but also somewhat "known". I think you would be interested in reading a good book on compiler design. The dragon-book is a perenial favorite, although there might be more up-to-date material available these days. At least, I hope there is! -sam
Re: To get things started...
On Tue, Nov 21, 2000 at 10:37:23AM +, David Grove wrote: > I'm not sure that it's possible to do this, or disirable. If Larry wants > Perl to use different modes, creoles, or ways of interpreting or > understanding the "perl" language, then we have to let the parser have a > bit more information. Yes, but these don't have to be external level calls. > As a point of clarification, I am seeing the external parser as that part > of perl that sees the user's script directly. Likewise. This is why the "creole" rulesets are *not* external calls. Syntax definitions, in the form of Perl programs or pragmata, will have to go through two stages before they can be used. First, they have to be parsed as Perl code. This is a call to the external API of the parser. Once this is done, the resulting op tree must be processed so that it can be turned into a data structure (representing the grammar) which can be understood by the parser. So, something's gone into the parser, and the parser has determined that this is a language definition - the parser then passes it off to the grammar-processor which constructs a grammar for it, and hands the grammar back to the parser. At this level, it is not "seeing the user's script directly", and thus I would say that the communication between the grammar-processor and the parser was an internal level API. > I don't think so. In a compiler I don't believe that the intermediate step > is there, I really *would* recommend Aho, Sethi, Ullman, "Compilers: Principles, Techniques and Tools". > and I've never seen any compiler accept multiple input semantics > and multiple output (meaning binary, bytecode, java, c#) (okay, C++ > Builder can accept pascal... but that's an one). gcc is a compiler which can receive C, C++, Objective C, and Fortran input, and produce output for quite an array of architectures. -- A successful [software] tool is one that was used to do something undreamed of by its author. -- S. C. Johnson
Re: To get things started...
On Mon, Nov 20, 2000 at 06:01:52PM -0500, Dan Sugalski wrote: > * The parser will be written mostly in perl, so you have regexes and such > to work with > * It's possible that the whole set of parsing rules may change on the fly, > so don't get hung up on constants like "{"--stick to symbolic things like > start_scope instead A thought strikes me. A few perl constructions ('', "", q(), qq() offhand, possibly others) can contain embedded newlines. A regular expression to match "" strings ( /"([^\\"]|\\.)*"/s ) is assuming that it has all the characters needed to match already in memory. A parser written in C typically sees the opening " and goes into a loop munching characters from the input until it meets the closing ". The input may be line buffered (as in current perl) but if the buffer runs out before the closing " it is refilled with another line as often as needed. How is our quoted string matcher going to work in the face of strings containing embedded literal newlines? Are we hoping that we can mmap() most scripts, so read isn't hugely a problem? And slrp the rest in one? [doesn't feel good] Are we going to have "lazy scalars" which collude with the regexp engine so that if the regexp engine hits the current end more is read from the file handle? Something else? Or is this no-a-problem for some reason I've not thought of? Nicholas Clark
Re: To get things started...
On Wed, 22 Nov 2000, Nicholas Clark wrote: > Are we hoping that we can mmap() most scripts, so read isn't hugely a > problem? And slrp the rest in one? [doesn't feel good] > Are we going to have "lazy scalars" which collude with the regexp engine > so that if the regexp engine hits the current end more is read from > the file handle? > Something else? Perhaps we could add a mode to the regex engine like: $filehandle =~ /.../; Where the engine itself would do the reading and buffering. Ok, that might not be such a good idea... This probably never returns, eh: $filehandle =~ /(.*)/; However we solve the problem I hope we can allow Perl programmers access to the solution. This is a very common problem with regex parsers. -sam
Re: To get things started...
At 11:45 PM 11/21/00 +, Tom Hughes wrote: >In message <[EMAIL PROTECTED]> > Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > At 10:18 AM 11/21/00 -0800, Benjamin Stuhl wrote: > > > > >Well, it would (IMHO) make more sense to have > > >perl6_parse_script (I do tend to follow > > >{subsystem,verb,object} naming...) > > > > Or Perl$parse_script, but that's a matter of taste, I suppose. :) > >Given that it isn't a valid C identifier, yes... Unless you're >using VAXC or DECC of course, which was your point I assume ;-) Odd. The Dec C docs don't mention it as a problem, and both Dec C on VMS and GCC on a linux box take it without complaint. They might've slipped it in as valid in the final ANSI standard or something. (I can't dig up my ANSI K&R to check, unfortunately) So it wasn't actually my point, though I'm fine with avoiding $ in identifiers, since I expect some platforms will be rather unhappy with it. (And other languages may well have restrictions that wouldn't allow it--I don't know if COBOL or Fortran are OK with dollar signs...) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
On Tue, Nov 21, 2000 at 09:39:16PM -0500, Dan Sugalski wrote: > At 11:45 PM 11/21/00 +, Tom Hughes wrote: > >In message <[EMAIL PROTECTED]> > > Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > > > At 10:18 AM 11/21/00 -0800, Benjamin Stuhl wrote: > > > > > > >Well, it would (IMHO) make more sense to have > > > >perl6_parse_script (I do tend to follow > > > >{subsystem,verb,object} naming...) > > > > > > Or Perl$parse_script, but that's a matter of taste, I suppose. :) > > > >Given that it isn't a valid C identifier, yes... Unless you're > >using VAXC or DECC of course, which was your point I assume ;-) > > Odd. The Dec C docs don't mention it as a problem, and both Dec C on VMS > and GCC on a linux box take it without complaint. They might've slipped it > in as valid in the final ANSI standard or something. (I can't dig up my > ANSI K&R to check, unfortunately) Crank up the warnings to strict ANSI and even DEC C moans. At least on Digital UNIX it does. $ cat x.c static int foo$bar = 42; $ cc -c -std1 x.c cc: Warning: x.c, line 1: Extension: A '$' was encountered in an identifier. static int foo$bar = 42; ---^ $ -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: To get things started...
I'm still not sure where to start from a technical standpoint, so I'll just comment and brainstorm until someone more used to this tells me whether my common cents should be in US Dollars or South African ZAR. Please forgive a bit of rambling, I'm not purposely off topic if I am. Dan Sugalski <[EMAIL PROTECTED]> wrote: > This list is here to design the internal and external API for the > parser/tokenizer/lexer part of perl. Basically we need two bits: > > 1) The API presented to the rest of the world. This is likely one call, > though if folks want to split it out for external and internal use, that's > fine. > > 2) The internal API. These are the places where hooks can be installed, or > bits of the parser that those hooks can call back into the parser. (Or > parser/lexer/tokenizer utility routines the hooks can call) These are almost two separate things entirely. (I don't get the "one call" thing. What do you mean?) First of all, if we take what Larry said and try to conceptualize it in terms of a parser, the external API needs to be flexible to handle perl in different writing styles... creoles I'd call them, since I think Larry would appreciate that term. (Amateur philologist here.) The external parser needs to be almost user configurable to accomplish this. Rather than simple, this is actually quite complex, since the external api needs to be able to take directions from many creoles and filter them into something that the internal parser can understand. I foresee as many mappings to internals in the external parser as the internal parser has to bytecode in the new perlguts. The external API needs to know what to map to where, and how. This is where the regexen basically come in, I think. (Read comments on index() vs regexen below). The API that I'm seeing, and I'm not particularly inventive in this area, is a perl hash-type structure mapping regexen to perlguts, where the particular mappings are determined by pragmata: use pythonic; use javanese; use tclish (:teehee); use hungarian; use forth; # drink fifth I also don't believe that this outer layer needs to be particularly intelligent when it comes to knowing perl's internals, but I do believe that it has to have a mind of its own if we're to provide the promised capabilities of alternate input styles. $PL_API_EX{'perl6' => {'PRINTCHAR' => [OPTYPE_RX, "\Q\bprint\b\s+(\w+\s+)(??{PL_STRING_LIST})\E"] } {'READSTDIN' => [OPTYPE_IX, ""] } } In this, I'm trying (with extreme and admittedly clumsy effort) to express that the perl6 (default) creole understand that in order to get to the PRINTCHAR internal API, it does a regular expression search (with an embedded function to find the nether end of the print command and use that as a part of the regex). Since we're doing this in perl and since we want a small core, this appears to be a Config.pm type problem, where syntax is defined externally, either in a module or some type of compiled thingy. Or, maybe it would be appropriate to go the Linux Kernel route, and decide at compile time what is in the "kernel" and what is loaded as a "module". (Hey, that sounds good for some PDD somewhere else). Now, the internal is actually the less brainy. It basically just needs to provide a commonality that the external API will connect to when using any creole. Mapping to bytecodes is beyond my skill when discussing a theoretical language, however. I do think that it is important to make the distinction between the external and internal modules. Larry made it clear that he wanted to separate these, for flexibility on both ends. (Also good for PDDing, I think.) However, one thing is seriously lacking in this theory... if the parser is perl, how does the perl parse? (Sort of a woodchuck chucking wood type of thing.) Somehow, the external parser API thingy has to know enough perl (through the chosen language) to be able to handle the parsing. To parse this thing, it would seem that we need a third layer... a C/C++/C-Larry parser (yylex, etc.). Once we have that, we can accomplish the goals. [GOALS] EXTERNAL API: 1. Provide a multi-creole interface as a middleman between the programmer and his language. 2. Provide a common interface (mapping) between the creole and the internal API. 3. Write it in Perl. INTERNAL API: 1. Expose the internal API to be used by the external API for use by the creoles. 2. Provide a common interface (mapping) between the internal API and the underlying language. 3. Write it in ... 4. Provide a mapping between the internal bytecodes and either internal Perl or translation API (the C# and Java thingies) [PROBLEMS] 1. Figure out how perl is going to parse perl without a perl to parse the perl with (we need a base parser of some type). The perl "kernel" may need to be defined as "just enough C to make perl parse". Larry did say that he'd like to move the c library out of the kernel... We'd need the basic data structures and regexen, and a basic bootstr
Re: To get things started...
On Tue, Nov 21, 2000 at 07:36:11AM -0500, David Grove wrote: > > 1) The API presented to the rest of the world. This is likely one call, > > These are almost two separate things entirely. (I don't get the "one call" > thing. What do you mean?) A parser does, essentially, one single thing: it takes text and turns it into an op tree. That's the only call you need to make from an external perspective. > the external API needs to be flexible to handle perl in different writing > styles This doesn't need to be the case; the external API may be language-agnostic, with the language rules set by internal calls. > > * The parser needs to be reentrant > No clue what this means. I need this defined in context. While parsing text, you should be able to dive into a separate bit of text, parse that, ("re-enter" the parser's routines) come out and carry on *exactly* where you left off, without your state being lost. > perl6 perl5 python tclish > \ \ / / > \ \ / / > --- > READSTDIN and other commons >full tree here > --- > | > | <- required > | > --- > OPCODES > --- > / /\ \ > / / \ \ > run store exea > bc bc binaryjava thingy I think you've just invented the compiler! :) -- It's difficult to see the picture when you are inside the frame.
Re: To get things started...
At 10:37 AM 11/21/00 +, David Grove wrote: >Thanks for the clarifications, Simon. > >Simon Cozens <[EMAIL PROTECTED]> wrote: >If we were simply feeding it perl with a single syntax, we could get away >with a "one call" scheme. But since we're dealing with almost certainly >mutually exclusive syntax and semantics, it probably needs more >information. But we are. The call is probably going to be something like: status = parse_perl(perl_interpreter *my_interp, char *script, struct HIR *end_result, long flags); the fact that the script has a "use pythonish;" in it is entirely irrelevant--the program calls into the parser, which returns a status and possibly a parsed representation of the program. The parser gets to deal with all the grotty details. >As a point of clarification, I am seeing the external parser as that part >of perl that sees the user's script directly. Mostly directly. There'll probably still be a level of indirection there, since we need to take into account embedding programs that may do odd things. > > > the external API needs to be flexible to handle perl in different >writing > > > styles > > > > This doesn't need to be the case; the external API may be > > language-agnostic, > > with the language rules set by internal calls. > >Then I'm misunderstanding the difference between external and internal. Yup, I think so, but that's OK. The internal API also doesn't need to care about a lot of language-level stuff either. We need to take into account the fact that the rules on what's a scalar (or a block start, or comment, or whatever) may be dynamic, but the parse_token call is the parse_token call, regardless of the rules in effect. > If >external touches the user's script, it can't separate itself from whatever >particular syntax is currently in use. Sure it can. It *must*, otherwise we'd need to rewrite parts of the parser every time we added another language variant. I don't want to have to rebuild perl just to use python mode. Yech. >Maybe I'm going beyond the purpose of "API". Let me know if this is the >case. Yup. You've got the language rules mixed in with the API. Separate beast, and one we're not dealing with here. > > > perl6 perl5 python tclish > > > \ \ / / > > > \ \ / / > > > --- > > > READSTDIN and other commons > > >full tree here > > > --- > > > | > > > | <- required > > > | > > > --- > > > OPCODES > > > --- > > > / /\ \ > > > / / \ \ > > > run store exea > > > bc bc binaryjava thingy > > > > I think you've just invented the compiler! :) > >I don't think so. In a compiler I don't believe that the intermediate step >is there, and I've never seen any compiler accept multiple input semantics >and multiple output (meaning binary, bytecode, java, c#)) Pretty much everyone's compiler does this at this point. Gcc's already been mentioned, but Dec's compiler suites for VAX and Alpha do the same thing, and from the literature it looks like other folks do it too. There's a custom front end that produces an intermediate representation, and a common IR->object optimizing back-end. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
At 07:36 AM 11/21/00 -0500, David Grove wrote: >However, one thing is seriously lacking in this theory... if the parser is >perl, how does the perl parse? (Sort of a woodchuck chucking wood type of >thing.) Somehow, the external parser API thingy has to know enough perl >(through the chosen language) to be able to handle the parsing. Nope. We do it in two phases. The end result will not actually parse perl code to build the parser (we'll provide bytecode for that) but to start we can run the parser through perl 5 to get a syntax tree until the perl 6 engine's capable of doing it itself. >[GOALS] >EXTERNAL API: >1. Provide a multi-creole interface as a middleman between the programmer >and his language. >2. Provide a common interface (mapping) between the creole and the >internal API. >3. Write it in Perl. Yup. >INTERNAL API: >1. Expose the internal API to be used by the external API for use by the >creoles. >2. Provide a common interface (mapping) between the internal API and the >underlying language. >3. Write it in ... Yup. >4. Provide a mapping between the internal bytecodes and either internal >Perl or translation API (the C# and Java thingies) Nope. The syntax-tree to bytecode converter's a separate piece. > > The general rules of the game are: > > > > * The parser will be written mostly in perl, so you have regexes and >such > > to work with > >To quote my perl elders, whatever can be done without regexen should be >done with index() (within limits, since some regexen can be quite >optimized). No, not really. regexes are generally easier to comprehend than their index couterparts, and often faster. (There's a lot of code that needs to go into backtracking...) While index might be better sometimes we can't force folks to use it. Almost all of perl is up for grabs. >The parser API needs to know both regexen and index() in order to work. The parser will have a fully-functional interpreter to work with. All of perl will likely be there for it. (Modules and threads might not, but that's still up in the air) > > * The parser will have an active interpreter structure handy > >Is this the perl that parses the perl? Yup. In fact we might have two--the interpreter structure for the interpreter running the parser, and the structure for the end-result parsed program. Or we might just use one and squirrel all the interpreter bits in a private (and deletable) namespace somewhere. >Yeah, I'm going over this in order. >Maybe I should have read the whole thing first. Nah, y'think? :-P > > * The parser needs to be reentrant > >No clue what this means. I need this defined in context. The parser needs to be able to call back into itself without screwing things up. Very little global state, in other words. > > * The ultimate output of the parser will be a syntax tree > >I think I said that. More or less. Perl will probably have two different intermediate representations, the parsed syntax tree and bytecode. The parser only spits out the syntax tree. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
--- Dan Sugalski <[EMAIL PROTECTED]> wrote: > At 10:37 AM 11/21/00 +, David Grove wrote: > >Thanks for the clarifications, Simon. > > > >Simon Cozens <[EMAIL PROTECTED]> wrote: > >If we were simply feeding it perl with a single syntax, > we could get away > >with a "one call" scheme. But since we're dealing with > almost certainly > >mutually exclusive syntax and semantics, it probably > needs more > >information. > > But we are. The call is probably going to be something > like: > >status = parse_perl(perl_interpreter *my_interp, > char *script, >struct HIR *end_result, >long flags); Well, it would (IMHO) make more sense to have perl6_parse_script (I do tend to follow {subsystem,verb,object} naming...) take a PerlIO*, so that it is completely transparent parsing from a file or a string. This gets almost into embedding issues, though (how much of a libc does perl6 really need? perl5 now carries large chunks of one around with it). > the fact that the script has a "use pythonish;" in it is > entirely > irrelevant--the program calls into the parser, which > returns a status and > possibly a parsed representation of the program. The > parser gets to deal > with all the grotty details. > [snip] -- BKS __ Do You Yahoo!? Yahoo! Shopping - Thousands of Stores. Millions of Products. http://shopping.yahoo.com/
Re: To get things started...
At 10:18 AM 11/21/00 -0800, Benjamin Stuhl wrote: >--- Dan Sugalski <[EMAIL PROTECTED]> wrote: > > At 10:37 AM 11/21/00 +, David Grove wrote: > > >Thanks for the clarifications, Simon. > > > > > >Simon Cozens <[EMAIL PROTECTED]> wrote: > > >If we were simply feeding it perl with a single syntax, > > we could get away > > >with a "one call" scheme. But since we're dealing with > > almost certainly > > >mutually exclusive syntax and semantics, it probably > > needs more > > >information. > > > > But we are. The call is probably going to be something > > like: > > > >status = parse_perl(perl_interpreter *my_interp, > > char *script, > >struct HIR *end_result, > >long flags); > >Well, it would (IMHO) make more sense to have >perl6_parse_script (I do tend to follow >{subsystem,verb,object} naming...) Or Perl$parse_script, but that's a matter of taste, I suppose. :) > take a PerlIO*, so that >it is completely transparent parsing from a file or a >string. This gets almost into embedding issues, though (how >much of a libc does perl6 really need? perl5 now carries >large chunks of one around with it). I'm not sure we want a PerlIO* passed in, for embedding reasons. I can see embedding programs wanting to pass in a pointer to a string with the whole script, a filehandle of some sort, or a pointer to a function that produces the script in really odd cases. (Possibly with a second pointer in that case to misc data) Maybe something like: perl_parse(interp *interp, void *script, void *extra, syntree *parsed_perl, int flags); where the flags arg indicates what sort of thing the script pointer is. Or perhaps: perl_parse(interp *interp, void *script, syntree *parsed_perl, int flags, void *extra); with the extra pointer and the flags argument vararg'd into optionality. (Defaulting to NULL and 0, respectively) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
Dan Sugalski <[EMAIL PROTECTED]> wrote: > At 10:37 AM 11/21/00 +, David Grove wrote: > >Thanks for the clarifications, Simon. > > > >Simon Cozens <[EMAIL PROTECTED]> wrote: > >If we were simply feeding it perl with a single syntax, we could get away > >with a "one call" scheme. But since we're dealing with almost certainly > >mutually exclusive syntax and semantics, it probably needs more > >information. > > But we are. The call is probably going to be something like: > >status = parse_perl(perl_interpreter *my_interp, > char *script, >struct HIR *end_result, >long flags); > > the fact that the script has a "use pythonish;" in it is entirely > irrelevant--the program calls into the parser, which returns a status and > possibly a parsed representation of the program. The parser gets to deal > with all the grotty details. What form is this intermediary parsed representation in? API, right? Then I need to clarify that when I say bytecode, I've meant whatever this intermediary parsed representation is, be that pure perl, API, or otherwise. p