Re: SvPV*
Dave Storrs <[EMAIL PROTECTED]> writes: >On Tue, 21 Nov 2000, Jarkko Hietaniemi wrote: > >> Yet another bummer of the current SVs is that they poorly fit into >> 'foreign memory' situations where the buffer is managed by something >> else than Perl. "No, thank you, Perl, keep your greedy fingers off >> this chunk. No, you may not play with it." > > > Out of curiousity, when might such a situation arise? When you >are embedding C in Perl, perhaps? Or calling an external library which returns a pointer to data. Right now we _have_ to copy it as there is no way to tell perl to (say) XFree() it rather than Safefree() it. Which is a pain when data is big. -- Nick Ing-Simmons <[EMAIL PROTECTED]> Via, but not speaking for: Texas Instruments Ltd.
Re: SvPV*
Nick Ing-Simmons <[EMAIL PROTECTED]> wrote: > Dave Storrs <[EMAIL PROTECTED]> writes: > >On Tue, 21 Nov 2000, Jarkko Hietaniemi wrote: > > > >> Yet another bummer of the current SVs is that they poorly fit into > >> 'foreign memory' situations where the buffer is managed by something > >> else than Perl. "No, thank you, Perl, keep your greedy fingers off > >> this chunk. No, you may not play with it." > > > > > > Out of curiousity, when might such a situation arise? When you > >are embedding C in Perl, perhaps? > > Or calling an external library which returns a pointer to data. > Right now we _have_ to copy it as there is no way to tell perl > to (say) XFree() it rather than Safefree() it. Which is a pain when data > is big. As long as destroy() is one of the vtable methods, then it should be fairly easy for someone to write an SV wrapper type that calls a specific free() - either fixed per type, or per SV. Roll on perl6 :-) Dave M.
Re: Backtracking through the source
In message <[EMAIL PROTECTED]> Simon Cozens <[EMAIL PROTECTED]> wrote: > I doubt it; I get the feeling that what Dan is talking about is infinite > look-*behind*. Nine times out of ten, you won't need to redo your parsing, > so having an infinite lookahead will just slow everything down. I didn't say that having infinite lookahead was better than allowing backtracking. I simply said that the two were equivalent and that any problem that can be solved by one can be solved by the other. > sub bar { ... } > print foo bar(); > > Now, having parsed this far, we know that foo is a filehandle that we're > printing to, so we build up our op tree to print the results of calling bar() > to a filehandle called foo; ooh, but what do we see now: > > sub foo { ... } > > Eek, foo was actually a subroutine, and we mean print(foo(bar())); need to > redo our parse tree. That's when the lookbehind comes into play. That's quite a nasty example for a number of reasons. Firstly you might have to back up and reparse a very large amount of code as the subroutine definition could be a very long way away from the print statement. Secondly in order to know that you needed to back up you'd have to remember that you hadd had to guess that foo was a filehandle but that it might also be a subroutine and it raises a whole serious of questions about what other similar things you might need to remember. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ ...Would you buy a Pontiac from this, er, man?
Re: The external interface for the parser piece
In message <[EMAIL PROTECTED]> Dan Sugalski <[EMAIL PROTECTED]> wrote: > The third parameter is the flags parameter, and it's optional. If omitted > or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard > null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter > is treated as if it points to a stream of bytes, where the first four are > the length of the source to be read followed by the source. If set to > PERL_FILE_SOURCE it's assumed to be a FILE *, while if set to > PERL_GENERATED_SOURCE it's assumed to be a pointer to a function that > returns a char pointer. If it's OR'd with PERL_UTF8_SOURCE then the stream > is assumed to be in UTF-8 format instead of platform native. This all seems a bit horrible to me. That kind of overloading of multiple meanings onto an argument is often a sign of a bad design that could be improved. Applying the maxim that any software design problem can be solved with sufficient levels of abstraction I'd suggest that passing some sort of abstract stream pointer would be better. Then there could be different sorts of streams that provided the source from a string or a file or whatever other wonderful data source somebody comes up with. The common case of parsing a string could of course be simplified with a small wrapper function that created a string based stream and then called the main parser entry point. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ ...A private sin is not so prejudicial in the world as a public indecency.
Re: The external interface for the parser piece
On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote: > In message <[EMAIL PROTECTED]> > Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > The third parameter is the flags parameter, and it's optional. If omitted > > or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard > > null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter > > is treated as if it points to a stream of bytes, where the first four are > > the length of the source to be read followed by the source. If set to > > PERL_FILE_SOURCE it's assumed to be a FILE *, while if set to > > PERL_GENERATED_SOURCE it's assumed to be a pointer to a function that > > returns a char pointer. If it's OR'd with PERL_UTF8_SOURCE then the stream > > is assumed to be in UTF-8 format instead of platform native. > > This all seems a bit horrible to me. That kind of overloading of > multiple meanings onto an argument is often a sign of a bad design > that could be improved. Agreed. > Applying the maxim that any software design problem can be solved > with sufficient levels of abstraction I'd suggest that passing some A related warning sign is trying to cram different semantic levels or types into same data. (C's "string model" being perhaps the most obvious example, getchar() having to be an int is another, "0 but true" a third...I want a "1 but false" :-) > sort of abstract stream pointer would be better. Then there could > be different sorts of streams that provided the source from a string > or a file or whatever other wonderful data source somebody comes up > with. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Backtracking through the source
On Tue, Nov 28, 2000 at 06:58:57PM +, Tom Hughes wrote: > I didn't say that having infinite lookahead was better than allowing > backtracking. I simply said that the two were equivalent and that any > problem that can be solved by one can be solved by the other. Fair enough. > That's quite a nasty example for a number of reasons. Firstly you > might have to back up and reparse a very large amount of code as the > subroutine definition could be a very long way away from the print > statement. You wouldn't have to reparse it all. You'd have to insert the new information into the parse and see how that changes things. It'd probably only change a very localised area, a single statement per occurence at most. > Secondly in order to know that you needed to back up you'd have to > remember that you hadd had to guess that foo was a filehandle but > that it might also be a subroutine and it raises a whole serious of > questions about what other similar things you might need to remember. Parsing Perl is not easy. :) At some points, you have to say, well, heck, I don't *know* what this token is. At the moment, perl guesses, and it guesses reasonably well. But guessing something wrongly which you could have got right if you'd read the next line strikes me as a little anti-DWIM. In a sense, though, you're right; this is a general problem. I'm currently trying to work out a design for a tokeniser, and it seems to me that there's going to be a lot of communicating of "hints" between the tokeniser, the lexer and the parser. The other alternative is to completely conflate the three, which would work but I think people would lose their minds. Take, for instance: ${function($value)}[$val] Now, how on earth do I split this into tokens? Do I say: /${/ - and expect some stuff which will resolve to a variable name or array reference, followed by a } If we go that way, we're passing lots of hints to both the lexer and the parser. /${[^}]+}/ and then /\[[^]]+\]/ If we do that, we have to keep state between the two tokens so that we don't make [$val] into a reference constructor and stuff up the parser. /^${([^}]+)}\[([^\])]/ - Deference $1 as an array, take value $2. If we do *that*, then we're already being tokeniser, lexer and parser rolled into one. Parsing Perl is hard. Trust me. :) -- MISTAKES: It Could Be That The Purpose Of Your Life Is Only To Serve As A Warning To Others http://www.despair.com
Re: Backtracking through the source
At 06:58 PM 11/28/00 +, Tom Hughes wrote: >In message <[EMAIL PROTECTED]> > Simon Cozens <[EMAIL PROTECTED]> wrote: > > > I doubt it; I get the feeling that what Dan is talking about is infinite > > look-*behind*. Nine times out of ten, you won't need to redo your parsing, > > so having an infinite lookahead will just slow everything down. > >I didn't say that having infinite lookahead was better than allowing >backtracking. I simply said that the two were equivalent and that any >problem that can be solved by one can be solved by the other. The big reason I was thinking about this is becase I'd like to be able to chop pieces off the front of the 'to be parsed' stream, so regexes can start with ^ instead of something else, but that's a minor issue, and one that's likely to be caught early-on in testing--I can't see us sending out a production release of perl with a whoops like that in the parser source. > > sub bar { ... } > > print foo bar(); > > > > Now, having parsed this far, we know that foo is a filehandle that we're > > printing to, so we build up our op tree to print the results of calling > bar() > > to a filehandle called foo; ooh, but what do we see now: > > > > sub foo { ... } > > > > Eek, foo was actually a subroutine, and we mean print(foo(bar())); need to > > redo our parse tree. That's when the lookbehind comes into play. > >That's quite a nasty example for a number of reasons. Firstly you >might have to back up and reparse a very large amount of code as the >subroutine definition could be a very long way away from the print >statement. Luckily that particular case can probably be dealt with without reparsing source--we'd just have to tromp back through the syntax tree and change the attribute of a node somewhere. (Or leave it until runtime, I suppose) It's also something that will be dealt with by the bytecode compiler rather than the parser, and it'll have a fully-enough parsed program to deal with things appropriately. (I suppose it would mean that if someone did this in a BEGIN block that the code in the BEGIN would treat foo as a filehandle but the rest of the program would treat it as a sub call, which could be an issue) >Secondly in order to know that you needed to back up you'd have to >remember that you hadd had to guess that foo was a filehandle but >that it might also be a subroutine and it raises a whole serious of >questions about what other similar things you might need to remember. Well, we do have the syntax tree, and can make whatever notes we want in the stash of the interpreter we're dealing with. Maybe an "intederminate" on the foo slot or something. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The external interface for the parser piece
Dan Sugalski wrote: > >int perl6_parse(PerlInterp *interp, >void *source, >int flags, >void *extra_pointer); Given that other things may want to be streamable in similar fashion (eg the regular expression engine), why not have a PerlDataSource union or somesuch that encapsulates all of the possibilities of the final three arguments? Or put all possibilities into a PerlIO*? That gives direct support for compressed source, source streamed over a network socket, etc., with a more common framework than PERL_GENERATED_SOURCE. Things like PERL_CHAR_SOURCE meaning nul-terminated char* sound unnecessarily specific. Also, you gave two options: nul-terminated and length-first. What about a "chunked" encoding, where you get multiple length-first chunks of the input (as in HTTP/1.1's Transfer-Encoding: chunked, for one example of many)? Or are nuls explicitly forbidden in source code? And, in a related question, the above interface appears that you call perl6_parse once. Will this be good enough, or do you want to have a PerlParseState* in/out parameter that allows restarting a parse once you get more of the input available? (With this, you don't need an explicit chunked encoding, since the caller can deal with that without being required to buffer the whole thing in memory before calling perl6_parse.) Or would that go into the PerlInterp too? And finally, how do I get the output out of the PerlInterp? Is it stored under some variable name, or does the PerlInterp start out empty and gains the parsed syntax tree as its only syntax tree, or ? (The latter sounds messy if the PerlInterp is also running code, code that wants to call some standard utility functions implemented in Perl.) Maybe I'm not making sense.
Re: The external interface for the parser piece
On Mon, 27 Nov 2000, Dan Sugalski wrote: > --- > >int perl6_parse(PerlInterp *interp, >void *source, >int flags, >void *extra_pointer); > > The third parameter is the flags parameter, and it's optional. If omitted > or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard > null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter > is treated as if it points to a stream of bytes, where the first four are > the length of the source to be read followed by the source. If set to Since you have a fourth argument couldn't that be used for the length of the byte stream rather than embedding that length into the byte stream itself? Makes more sense to me to separate the bytes from the length. -- Tim Jenness JCMT software engineer/Support scientist http://www.jach.hawaii.edu/~timj
Re: The external interface for the parser piece
At 07:03 PM 11/28/00 +, Tom Hughes wrote: >In message <[EMAIL PROTECTED]> > Dan Sugalski <[EMAIL PROTECTED]> wrote: > > > The third parameter is the flags parameter, and it's optional. If omitted > > or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard > > null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter > > is treated as if it points to a stream of bytes, where the first four are > > the length of the source to be read followed by the source. If set to > > PERL_FILE_SOURCE it's assumed to be a FILE *, while if set to > > PERL_GENERATED_SOURCE it's assumed to be a pointer to a function that > > returns a char pointer. If it's OR'd with PERL_UTF8_SOURCE then the stream > > is assumed to be in UTF-8 format instead of platform native. > >This all seems a bit horrible to me. That kind of overloading of >multiple meanings onto an argument is often a sign of a bad design >that could be improved. Sure, that's distinctly possible. I'm shooting for extreme simplicity in the standard case here, but that doesn't mean I'm hitting it. (Or anything else for that matter) >Applying the maxim that any software design problem can be solved >with sufficient levels of abstraction I'd suggest that passing some >sort of abstract stream pointer would be better. Then there could >be different sorts of streams that provided the source from a string >or a file or whatever other wonderful data source somebody comes up >with. Right, and I called my abstract stream "void *source". :) >The common case of parsing a string could of course be simplified >with a small wrapper function that created a string based stream >and then called the main parser entry point. That means another function in the API. I suppose perl_parse_string() and perl_parse_file() are valid options. I'd rather keep the API that embedders will be using as small as possible, but two functions with simple names and pameters may be better than one function with mildly odd parameters in the non-trivial case. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The external interface for the parser piece
At 01:25 PM 11/28/00 -0600, Jarkko Hietaniemi wrote: >On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote: > > Applying the maxim that any software design problem can be solved > > with sufficient levels of abstraction I'd suggest that passing some > >A related warning sign is trying to cram different semantic levels or >types into same data. (C's "string model" being perhaps the most >obvious example, getchar() having to be an int is another, "0 but true" >a third...I want a "1 but false" :-) Which ways is that one being violated? (I can think of a couple personally... :) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The external interface for the parser piece
At 09:10 AM 11/28/00 -1000, Tim Jenness wrote: >On Mon, 27 Nov 2000, Dan Sugalski wrote: > > > --- > > > >int perl6_parse(PerlInterp *interp, > >void *source, > >int flags, > >void *extra_pointer); > > > > > The third parameter is the flags parameter, and it's optional. If omitted > > or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard > > null-terminated string. If set to PERL_COUNTED_SOURCE, the second > parameter > > is treated as if it points to a stream of bytes, where the first four are > > the length of the source to be read followed by the source. If set to > >Since you have a fourth argument couldn't that be used for the length >of the byte stream rather than embedding that length into the byte stream >itself? Makes more sense to me to separate the bytes from the length. I'd rather the stream be self-contained, rather than needing an extra argument for the length. Counted strings aren't uncommon outside of C, and there's no reason a Fortran or COBOL (or Java, or...) program can't embed perl. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The external interface for the parser piece
On Tue, Nov 28, 2000 at 03:35:37PM -0500, Dan Sugalski wrote: > > > is treated as if it points to a stream of bytes, where the first four are I spy magic number. > > > the length of the source to be read followed by the source. If set to > > > >Since you have a fourth argument couldn't that be used for the length > >of the byte stream rather than embedding that length into the byte stream > >itself? Makes more sense to me to separate the bytes from the length. > > I'd rather the stream be self-contained, rather than needing an extra > argument for the length. Counted strings aren't uncommon outside of C, and > there's no reason a Fortran or COBOL (or Java, or...) program can't embed perl. Why four? Surely that's imposing an arbitrary binary structure. If it's a parameter then it's (probably) a machine register and certainly a "natural" quantity for whatever's running the code (and automatically the correct endian-ness just in case perl is running in some (oddball partial) binary emulation environment. Erm. Or something like that. I forget the source of the quote, but it was to the effect of C is the only language where not just the binaries but also the source is not portably. Say you'd said 2 not 4. struct counted_file { short count; struct { char bytes[1]; } file; }; erm. can't have bytes[0]; because that's not portable. Can't really be short because who said that that was 2 bytes? For that matter I know of one compiler which doesn't have any type sizeof(2), and sizeof (struct counted_file) is 8 here on this arm machine :-) Wierdo but ANSI compliant alignment constraints. [yes, I forced that one using the second struct inside the first] Nicholas Clark
Re: To get things started...
Bart Lateur <[EMAIL PROTECTED]> writes: > >But what if you choose wrong, forgat a really important one, and this >instruction gets a multibyte representation? We're stuck with it >forever...? > >I have had some thoughts on "dynamic opcodes", where the meaning of >opcode bytes needn't be fixed, but can be dynamically assigned, >depending on how often they occur (for example). A bit like how a >Huffman compressor may choose shorter representations for the most >occurring byte patterns. This is just like HW processor opcodes. x86 has lasted so well because the initial guess at the short/common opcodes was not too bad. But the escape bytes are getting out of hand now... -- Nick Ing-Simmons <[EMAIL PROTECTED]> Via, but not speaking for: Texas Instruments Ltd.
Re: The external interface for the parser piece
On Tue, Nov 28, 2000 at 03:34:22PM -0500, Dan Sugalski wrote: > At 01:25 PM 11/28/00 -0600, Jarkko Hietaniemi wrote: > >On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote: > > > Applying the maxim that any software design problem can be solved > > > with sufficient levels of abstraction I'd suggest that passing some > > > >A related warning sign is trying to cram different semantic levels or > >types into same data. (C's "string model" being perhaps the most > >obvious example, getchar() having to be an int is another, "0 but true" > >a third...I want a "1 but false" :-) > > Which ways is that one being violated? (I can think of a couple > personally... :) Embedding the (fixed-length) length into the data. As Nicholas points out, that is naughty. Remember: sizeof(char) >= sizeof(short) >= sizeof(int) >= sizeof(long) sizeof(char) == 1 (IIRC) are the only guarantees you get. No structure alignment/padding guarantees. Let's pick a platform that would have difficulties: Cray C-series (nowadays called SV-series, I think). There's *no* integer data type four bytes wide (or two bytes, for that matter). It's either 1 (char), or 8. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: The external interface for the parser piece
Err, this seems a little too Swiss Army Knife. This reads like a utility function. (i.e. A function that handles the most common scenerio.) Shouldn't a set of lower level visible API be visible? One that seems to pop out at me is some way of actually parsing a piece of code and ending up with a handle on a syntax tree. And ways of adding and removing these pieces. These are abstract functions that would be needed on the interior of the parser, but a bottom up approach may be more appropriate here. I also like the suggestion that rather than supply flags, we should follow the lead and supply a Perl* something that would return an appropriate bunch of text to the parser. > "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes: DS> While I'm not sure of the structure of the internals of the parsing piece DS> of perl at the moment (and, unfortunately, language parsers aren't one of DS> my strong points), I am reasonably certain of the interface we'll present DS> to the rest of the world and the other pieces of perl. So... comments? DS> --- DS>int perl6_parse(PerlInterp *interp, DS>void *source, DS>int flags, DS>void *extra_pointer); DS> The first parameter is a pointer to a perl interpreter--this'll be used if DS> any code needs to be executed, as well as being a repository for any DS> variables that compiled code may set. (Standard stash stuff) The syntax DS> tree the parser generates will also be embedded here. (One fewer parameter DS> to deal with, and one fewer thing for an embedding program to track) DS> The second parameter is a pointer to the source to be compiled. This is DS> generally a char pointer, but it may also be a FILE * or a pointer to a DS> function that returns a char pointer. DS> The third parameter is the flags parameter, and it's optional. If omitted DS> or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard DS> null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter DS> is treated as if it points to a stream of bytes, where the first four are DS> the length of the source to be read followed by the source. If set to DS> PERL_FILE_SOURCE it's assumed to be a FILE *, while if set to DS> PERL_GENERATED_SOURCE it's assumed to be a pointer to a function that DS> returns a char pointer. If it's OR'd with PERL_UTF8_SOURCE then the stream DS> is assumed to be in UTF-8 format instead of platform native. DS> The fourth parameter is only used if the flags are set to DS> PERL_GENERATED_SOURCE, in which case it is passed back to the function DS> whose pointer we got as parameter two. DS> Dan DS> --"it's like this"--- DS> Dan Sugalski even samurai DS> [EMAIL PROTECTED] have teddy bears and even DS> teddy bears get drunk -- Chaim FrenkelNonlinear Knowledge, Inc. [EMAIL PROTECTED] +1-718-236-0183
Re: The external interface for the parser piece
On Tue, Nov 28, 2000 at 03:15:35PM -0600, Jarkko Hietaniemi wrote: > On Tue, Nov 28, 2000 at 03:34:22PM -0500, Dan Sugalski wrote: > > At 01:25 PM 11/28/00 -0600, Jarkko Hietaniemi wrote: > > >On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote: > > > > Applying the maxim that any software design problem can be solved > > > > with sufficient levels of abstraction I'd suggest that passing some > > > > > >A related warning sign is trying to cram different semantic levels or > > >types into same data. (C's "string model" being perhaps the most > > >obvious example, getchar() having to be an int is another, "0 but true" > > >a third...I want a "1 but false" :-) > > > > Which ways is that one being violated? (I can think of a couple > > personally... :) > > Embedding the (fixed-length) length into the data. As Nicholas points > out, that is naughty. Remember: > > sizeof(char) >= sizeof(short) >= sizeof(int) >= sizeof(long) I think it's time for me to go home for today. Please reverse the > signs as you read :-) -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: The external interface for the parser piece
At 03:15 PM 11/28/00 -0600, Jarkko Hietaniemi wrote: >On Tue, Nov 28, 2000 at 03:34:22PM -0500, Dan Sugalski wrote: > > At 01:25 PM 11/28/00 -0600, Jarkko Hietaniemi wrote: > > >On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote: > > > > Applying the maxim that any software design problem can be solved > > > > with sufficient levels of abstraction I'd suggest that passing some > > > > > >A related warning sign is trying to cram different semantic levels or > > >types into same data. (C's "string model" being perhaps the most > > >obvious example, getchar() having to be an int is another, "0 but true" > > >a third...I want a "1 but false" :-) > > > > Which ways is that one being violated? (I can think of a couple > > personally... :) > >Embedding the (fixed-length) length into the data. As Nicholas points >out, that is naughty. Remember: > > sizeof(char) >= sizeof(short) >= sizeof(int) >= sizeof(long) > sizeof(char) == 1 > >(IIRC) are the only guarantees you get. No structure alignment/padding >guarantees. Let's pick a platform that would have difficulties: >Cray C-series (nowadays called SV-series, I think). There's *no* >integer data type four bytes wide (or two bytes, for that matter). >It's either 1 (char), or 8. There's always: length = (getc() * 256) + getc()) * 256) + getc()) * 256) + getc() give or take a few parens... Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: The external interface for the parser piece
At 09:05 PM 11/28/00 +, Nicholas Clark wrote: >On Tue, Nov 28, 2000 at 03:35:37PM -0500, Dan Sugalski wrote: > > > > is treated as if it points to a stream of bytes, where the first > four are > > >I spy magic number. Nah. 32-bit length. If someone needs to pass us more than 4G of source code, I do *not* want to know about it. :) > > > > the length of the source to be read followed by the source. If set to > > > > > >Since you have a fourth argument couldn't that be used for the length > > >of the byte stream rather than embedding that length into the byte stream > > >itself? Makes more sense to me to separate the bytes from the length. > > > > I'd rather the stream be self-contained, rather than needing an extra > > argument for the length. Counted strings aren't uncommon outside of C, and > > there's no reason a Fortran or COBOL (or Java, or...) program can't > embed perl. > > >Why four? Surely that's imposing an arbitrary binary structure. If it's a >parameter then it's (probably) a machine register and certainly a "natural" >quantity for whatever's running the code (and automatically the correct >endian-ness just in case perl is running in some (oddball partial) >binary emulation environment. Erm. Or something like that. It's not necessarily in a register. In at least some of the languages I named (and you can add BASIC and pascal to the list as well), a string consists of a length and data pointer pair, usually together. What's handy is a pointer to the data structure, not the length and a pointer to the buffer. Of course, for some of those languages the lengths are 16-bit quantities. Damn. >I forget the source of the quote, but it was to the effect of >C is the only language where not just the binaries but also the source is >not portably. > >Say you'd said 2 not 4. > >struct counted_file { > short count; > struct { > char bytes[1]; > } file; >}; > > >erm. can't have bytes[0]; because that's not portable. That'd probably be: struct counted_string { int length; char data[]; } which is legal ANSI C. Not that it helps with the size of an int issue, though. >Can't really be short because who said that that was 2 bytes? >For that matter I know of one compiler which doesn't have any type >sizeof(2), and sizeof (struct counted_file) is 8 here on this arm machine >:-) Wierdo but ANSI compliant alignment constraints. >[yes, I forced that one using the second struct inside the first] Y'know, I really loathe C. Really, really, loathe it. Anyway, regardless of the platform, there is *some* way to force this to work--if there weren't, then implementing things like a TCP stack would be pretty much impossible. Counted strings should probably just have either a platform-native int in front, or a 32-bit int in network format, both of which should be doable on any platform that perl deals with. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: To get things started...
Nicholas Clark <[EMAIL PROTECTED]> writes: >On Mon, Nov 27, 2000 at 05:17:47PM +, Nicholas Clark wrote: >> On Mon, Nov 27, 2000 at 11:09:03AM -0500, Chaim Frenkel wrote: >> > > "ST" == Sam Tregar <[EMAIL PROTECTED]> writes: >> >> > Look throught the RFCs this was one of Damian Conway's. >> > >> > =~ /RFC/ >> >> http://dev.perl.org/rfc/93.html >> >> I know I read it, I just don't remember reading it. >> >> IMPLEMENTATION >> >> Dammit, Jim, I'm a doctor, not an magician! >> >> Probably needs to be integrated with IO disciplines too. >> >> He's right, but Nick's intending to implement an unread() (rather than just >> ungetc()) so there should be enough rope for people to implement whatever >> knots take their fancy (including the Jack Ketch knot) >> >> Hugo makes some comments about implementation of this in: >> http:[EMAIL PROTECTED]/msg00459.html > >Bah. meant to add that it might be logical for > > =~ /RFC/ > >to seek to the beginning of the file before it starts > > =~ /\GRFC/gc > >carries on from the previous position and doesn't seek back to the beginning >(or otherwise throw all the buffered data away) > >Which effectively makes pos analogous to seek/tell. I was musing on how to make "layers" visible to perl code. And using pos() to point at the current position in the buffer (note the _buffer_ not the _file) was one idea I came up with. >So do we get rid of poss and seek() our scalars? :-) Keep pos() and loose seek ;-) >It also allows the possibility of pos on file handles being fsetpos/fgetpos >Maybe that should have been an rfc 3 months ago, and really doesn't even >matter if perlio obsoletes stdio and internalises stdio's distinction between >text and binary streams. > >BTW I am serious about needing a /gc not to chuck the buffered data. > >It makes something like > > @found = =~ /RFC +(\d+)/; > >not spend time stacking a lot of data back that's only about to be discarded. > >But this isn't internals really, is it? I'm miles off topic. > >Nicholas Clark -- Nick Ing-Simmons <[EMAIL PROTECTED]> Via, but not speaking for: Texas Instruments Ltd.
Re: The external interface for the parser piece
At 04:23 PM 11/28/00 -0500, Chaim Frenkel wrote: >Err, this seems a little too Swiss Army Knife. > >This reads like a utility function. (i.e. A function that handles the >most common scenerio.) What it's supposed to be is the highest-level interface to the parser, and so it's supposed to handle all the common cases without requiring whoever's using it to read more than half a page of documentation, total. >Shouldn't a set of lower level visible API be visible? One that seems >to pop out at me is some way of actually parsing a piece of code and >ending up with a handle on a syntax tree. And ways of adding and removing >these pieces. Sure. That would be the internal API bit. Nobody's put anything solid forward yet for that bit. Anyone? Anyone? Bueller? >These are abstract functions that would be needed on the interior of the >parser, but a bottom up approach may be more appropriate here. Sure. Suggestions? >I also like the suggestion that rather than supply flags, we should >follow the lead and supply a Perl* something that would return an >appropriate bunch of text to the parser. I'd really rather not, since that would place the burden of knowing too much about the guts of perl on whoever's using it. I don't want to do that. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Backtracking through the source
> Is there any reasonable case where we would need to backtrack over > successfully parsed source and redo the parsing? I'm not talking about the > case where regular expressions run over text and ultimately fail, but > rather cases where we need to chuck out part of what we have and restart? ]- I think that we should have this possibility Of course if something can be solved w/o backtracking it will be solved in that way . perl parser will not parse only PERL but also many other TARGET-languages . and one more feature will not be worse but better (backtracking, score based decision, lookbehind etc... ) = iVAN [EMAIL PROTECTED] =
Re: The external interface for the parser piece
At 09:48 AM 11/28/00 -0800, Steve Fink wrote: >Dan Sugalski wrote: > > > >int perl6_parse(PerlInterp *interp, > >void *source, > >int flags, > >void *extra_pointer); > >Given that other things may want to be streamable in similar fashion (eg >the regular expression engine), why not have a PerlDataSource union or >somesuch that encapsulates all of the possibilities of the final three >arguments? Or put all possibilities into a PerlIO*? That gives direct >support for compressed source, source streamed over a network socket, >etc., with a more common framework than PERL_GENERATED_SOURCE. Embedding is the big reason. This interface should be simple for embedding programs, most of which will either pass in a C filehandle or a plain char* with source in it. That's why there's no fancy structures or anything that go in. (Well, besides the perlinterp structure, but that's pretty much a magic cookie as far as programs are concerned) >Things like PERL_CHAR_SOURCE meaning nul-terminated char* sound >unnecessarily specific. Well, it is the most common type of string that perl's going to see, which is why it's in there. UTF-8's the next most likely one, hence the flag. >Also, you gave two options: nul-terminated and length-first. What about >a "chunked" encoding, where you get multiple length-first chunks of the >input (as in HTTP/1.1's Transfer-Encoding: chunked, for one example of >many)? Or are nuls explicitly forbidden in source code? Nulls aren't explicitly forbidden, but they're real inconvenient in C-style strings, hence the length option. (Plus we might be able to do Clever Things if we know the length) I'm not sure how UTF-8 jammed into C strings works either, since IIRC there can be null bytes in a UTF-8 data stream. Nulls are OK in the source on disk, though they're still annoying inside a C program. (Like, say, perl... :) >And, in a related question, the above interface appears that you call >perl6_parse once. Will this be good enough, or do you want to have a >PerlParseState* in/out parameter that allows restarting a parse once you >get more of the input available? (With this, you don't need an explicit >chunked encoding, since the caller can deal with that without being >required to buffer the whole thing in memory before calling >perl6_parse.) Or would that go into the PerlInterp too? What I was thinking, but didn't say, is that for the PERL_GENERATED_SOURCE case we'd just call the function provided over and over until it returns NULL, at which point we assume it's all done. So for the chunked text case, each call to the function would return a chunk, and the function would return NULL when it's run out of chunks. >And finally, how do I get the output out of the PerlInterp? Is it stored >under some variable name, or does the PerlInterp start out empty and >gains the parsed syntax tree as its only syntax tree, or ? (The latter >sounds messy if the PerlInterp is also running code, code that wants to >call some standard utility functions implemented in Perl.) Maybe I'm not >making sense. It's stored in the PerlInterp structure. Where I don't know, but that can be put off for later. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Basic embedding [was: Re: The external interface for the parser piece]
--- Steve Fink <[EMAIL PROTECTED]> wrote: > Dan Sugalski wrote: > > > >int perl6_parse(PerlInterp *interp, > >void *source, > >int flags, > >void *extra_pointer); > > Given that other things may want to be streamable in > similar fashion (eg > the regular expression engine), why not have a > PerlDataSource union or > somesuch that encapsulates all of the possibilities of > the final three > arguments? Or put all possibilities into a PerlIO*? That > gives direct > support for compressed source, source streamed over a > network socket, > etc., with a more common framework than > PERL_GENERATED_SOURCE. Hear, hear! This is almost an embedding issue, though (cc-ing perl6-internals-api-embed): How much of the standard perl RTL is _required_ (I.e., PerlIO, perl malloc, etc.). Offhand, I think that there is a very strong case to require at least the basic PerlIO, since without it, perl6 can't count on having a non-bugridden I/O library, and also can't take advantage of PerlIO's non-std. features (whatever they end up being). > Things like PERL_CHAR_SOURCE meaning nul-terminated char* > sound > unnecessarily specific. > > Also, you gave two options: nul-terminated and > length-first. What about > a "chunked" encoding, where you get multiple length-first > chunks of the > input (as in HTTP/1.1's Transfer-Encoding: chunked, for > one example of > many)? Or are nuls explicitly forbidden in source code? > > And, in a related question, the above interface appears > that you call > perl6_parse once. Will this be good enough, or do you > want to have a > PerlParseState* in/out parameter that allows restarting a > parse once you > get more of the input available? (With this, you don't > need an explicit > chunked encoding, since the caller can deal with that > without being > required to buffer the whole thing in memory before > calling > perl6_parse.) Or would that go into the PerlInterp too? > > And finally, how do I get the output out of the > PerlInterp? Is it stored > under some variable name, or does the PerlInterp start > out empty and > gains the parsed syntax tree as its only syntax tree, or > ? (The latter > sounds messy if the PerlInterp is also running code, code > that wants to > call some standard utility functions implemented in > Perl.) Maybe I'm not > making sense. This sort of leads into an idea I've been having about what defines an interpreter. I've sort of been musing on the following embedding interface: /* inits subsystems: PerlIO,memory,etc. call once at start of program */ int perl_boot(); /* subsystem shutdown - call at program shutdown */ int perl_shutdown(); /* a perl6 interpreter - defines complete interpreter*/ typedef struct _perl_interp perl_interpreter; typedef struct _perl_thread perl_thread; struct _perl_interp { perl_thread *thread_list; perl_thread *root_thread; /* "top-level" thread - used to parse the primary script (or provide an arbitrary perl_thread for embedders) */ HV *shared_stash; /* subroutines are global to an interpreter */ HV *subroutine_stash; ... }; /* a thread of execution in a perl_interpreter - contain's thread's stash and stacks */ struct _perl_thread { perl_interpreter *threads_interp; OP *pc; SV *sp; HV *thread_stash; void *save_stack; RE_context *RE_data; perl_parser_state *parser; ... }; /* creates an interpreter */ perl_interpreter * perl_create_interp(int flags); /* ... ways of calling in (parse command line, call code, etc.)... */ /* destroy and free an interpreter */ void perl_delete_interp(perl_interpreter *); /* the embedder is expected to provide the following */ /* get this OS thread's current perl_thread */ perl_thread* perl_fetch_thread(); /* set this OS thread's current perl_thread (called in Thread->new &co.) */ perl_thread* perl_set_thread(); /* get the perl_thread who will handle signals */ perl_thread* perl_get_sig_thread(); The idea behind the perl_interpreter/perl_thread separation is that perl6 internal calls will actually pass a perl_thread * around, since that is the basic unit of execution, and if bytecode/optree is to be shared between threads (as I devoutly hope it will be), there needs to be something to aggregate a group of perl_threads. To go back to parser API design, I think that perl6_parse_perl should take a perl_thread* to provide context for sub {} declarations, parse errors, &co. Top-level code would be treated as either the top-level script, or an eval'', depending on the flags. -- BKS __ Do You Yahoo!? Yahoo! Shopping - Thousands of Stores. Millions of Products. http://shopping.yahoo.com/
Re: The external interface for the parser piece
Dan Sugalski wrote: > > Sure. Suggestions? int perl6_parse(PerlInterp* interp, PerlIO* input); PerlIO* make_memory_stream(char* buf, ssize_t length); // length=-1 for nul-terminated int close_stream(PerlIO* stream); then if you read further, you'll eventually see: PerlIO* make_callback_stream(int (*f)(char* buf, int space, void* other), void* other); Or maybe the first thing you see is just: int perl6_parse_string(PerlInterp* interp, char* buf, ssize_t length); // length=-1 for nul-term if that really is 95% of the cases. I guess I just think that when discussing perl6_parse, it's less effort to mention the existence of make_memory_stream and close_stream than it is to explain the meaning of three mystery parameters. Especially if that knowledge can be reused for half a dozen other API calls. perl6_Scalar* perl6_eval_scalar(PerlInterp*,PerlIO*); perl6_List* perl6_eval_list(PerlInterp*,PerlIO*); int perl6_load_module(PerlInterp*, PerlIO*); // Checks magic number for .pm vs .pmc, BOM, gzip...
Re: The external interface for the parser piece
In message <[EMAIL PROTECTED]> Dan Sugalski <[EMAIL PROTECTED]> wrote: > Right, and I called my abstract stream "void *source". :) It isn't really abstract though as it only understand types of streams that the parser author had thought of. An abstract stream would have a vtable or something so that the parser didn't have to know anything about where the data was coming from, thus decoupling the parser more from the text it is parsing. It would also be typesafe. > That means another function in the API. I suppose perl_parse_string() and > perl_parse_file() are valid options. I'd rather keep the API that embedders > will be using as small as possible, but two functions with simple names and > pameters may be better than one function with mildly odd parameters in the > non-trivial case. I would probably suggest something like this: int perl_parse(PerlInterp *interp, PerlStream *source) { ... } int perl_parse_string(PerlInterp *interp, const char *source) { PerlStream *stream = new_string_stream(source); return perl_parse(interp, stream); } int perl_parse_file(PerlInterp *interp, const char *filename) { PerlStream *stream = new_file_stream(filename, "r"); return perl_parse(interp, stream); } You are of course quite right that it adds functions to the API but is the number of functions in the API critical? I would have thought that the above provides a good trade off between simplicity for most people and power for those that need it whilst still maintaining type safety and maximum extensibility for things we havn't thought of yet. We might also still want a flags word to each of those routine for things like your UTF8 flag. Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ ...F u cn rd ths u cnt spl wrth a dm!
Re: Backtracking through the source
In message <[EMAIL PROTECTED]> Simon Cozens <[EMAIL PROTECTED]> wrote: > Parsing Perl is not easy. :) You can say that again ;-) > At some points, you have to say, well, heck, I don't *know* what this token > is. At the moment, perl guesses, and it guesses reasonably well. But > guessing something wrongly which you could have got right if you'd read the > next line strikes me as a little anti-DWIM. Quite likely you're right. I can't say I have much experience of parsers that do this but we can always blaze a new trail in our efforts to parse perl. > In a sense, though, you're right; this is a general problem. I'm currently > trying to work out a design for a tokeniser, and it seems to me that > there's going to be a lot of communicating of "hints" between the > tokeniser, the lexer and the parser. You have to be vary careful about downward communication from the parser to the lexer if there's any lookahead involved as you can find that you're trying to affect the lexing of tokens which are already in the lookahead buffer of the parser. Backtracking may well be better than lookahead here as you can always jump back a bit after you change the lexer's state ;-) > Parsing Perl is hard. Trust me. :) Oh, you did say it again... Parsing Fortran is fun as well. Whoever decided to allow spaces in identifiers needs their head read... Tom -- Tom Hughes ([EMAIL PROTECTED]) http://www.compton.nu/ ...Who's on first?
Re: Backtracking through the source
Tom Hughes wrote: > > In message <[EMAIL PROTECTED]> > Simon Cozens <[EMAIL PROTECTED]> wrote: > > > In a sense, though, you're right; this is a general problem. I'm currently > > trying to work out a design for a tokeniser, and it seems to me that > > there's going to be a lot of communicating of "hints" between the > > tokeniser, the lexer and the parser. > > You have to be vary careful about downward communication from the > parser to the lexer if there's any lookahead involved as you can > find that you're trying to affect the lexing of tokens which are > already in the lookahead buffer of the parser. > > Backtracking may well be better than lookahead here as you can always > jump back a bit after you change the lexer's state ;-) The difference may be more illusory than real. In either case, you have to undo something: your recursion stack for backtracking, some parser state for lookahead. And in both cases, undoing those things is a hell of a lot easier than undoing the code that was run when you recognized (or thought you recognized) some chunk of tokens as an anonymous sub or whatever. For example, say you stuck an entry for the subroutine into the package's symbol table. You'd better have kept the original, in case you were wrong -- but you might not want to keep all originals, or you'll blow your memory. Perhaps we can avoid doing anything significant during parsing (when will BEGIN{} run?), but perhaps not. Handling the parser's state can be done in a backtracking DFA-like or a direct NFA-like way. The NFA way is to keep track of all possible parse states and advance each one in parallel based on the next token. The DFA way is recursive descent, backing out of blind alleys and trying again, keeping a single working hypothesis alive at a time. The DFA approach is probably easier to undo user code in, because in the NFA case you have to consider each token under the assumptions of all possible parses up to that point. The NFA case has the advantage that you never have to back up, so you can permanently forget about a token as soon as it whizzes by. Perl5 is parseable with a single token of lookahead and lots of parser/lexer communication. Sort of. It would be nice to prevent it from getting any worse. We could pretend to support full DWIMmery by telling the user when it fails: 10 print foo bar(); 13 sub foo { ... } DWIMmery badness 1: Sorry, but I screwed up by assuming 'foo' was a direct object in line 10, and only found out on line 13. Would you mind predeclaring 'foo' somewhere before line 10? ...but that would be weird. print foo bar(); eval "sub foo { $code }"; print foo bar();
Re: The external interface for the parser piece
On Tue, 28 Nov 2000, Dan Sugalski wrote: > >I also like the suggestion that rather than supply flags, we should > >follow the lead and supply a Perl* something that would return an > >appropriate bunch of text to the parser. > > I'd really rather not, since that would place the burden of knowing too > much about the guts of perl on whoever's using it. I don't want to do that. You're going to need knowledge in either case - whether you're directly setting flags or have a PerlFlags object (with its own limited interface, I suppose) that you pass in. The advantage of the object is that you aren't limited to just flags down the road, which may cut down on the number of overall API calls that exist. Of course, if you've got a couple dozen actual flags, you may want to combine the two: PerlFlags *flags_and_such; PAPI_set_flags(flags_and_such, PL_DONT_CRASH | PL_RUN_FASTER | PL_DWIM); PAPI_set_malloc_arena(flags_and_such, *malloc_func, *arena); return_code = perl6_parse(interp, source, flags_and_such, NULLP); -- Bryan C. Warnock bwarnock@(gtemail.net|capita.com)