Re: SvPV*

2000-11-28 Thread Nick Ing-Simmons

Dave Storrs <[EMAIL PROTECTED]> writes:
>On Tue, 21 Nov 2000, Jarkko Hietaniemi wrote:
>
>> Yet another bummer of the current SVs is that they poorly fit into
>> 'foreign memory' situations where the buffer is managed by something
>> else than Perl.  "No, thank you, Perl, keep your greedy fingers off
>> this chunk.  No, you may not play with it."
>
>
>   Out of curiousity, when might such a situation arise?  When you
>are embedding C in Perl, perhaps?

Or calling an external library which returns a pointer to data.
Right now we _have_ to copy it as there is no way to tell perl 
to (say) XFree() it rather than Safefree() it. Which is a pain when data
is big.

-- 
Nick Ing-Simmons <[EMAIL PROTECTED]>
Via, but not speaking for: Texas Instruments Ltd.




Re: SvPV*

2000-11-28 Thread David Mitchell

Nick Ing-Simmons <[EMAIL PROTECTED]> wrote:
> Dave Storrs <[EMAIL PROTECTED]> writes:
> >On Tue, 21 Nov 2000, Jarkko Hietaniemi wrote:
> >
> >> Yet another bummer of the current SVs is that they poorly fit into
> >> 'foreign memory' situations where the buffer is managed by something
> >> else than Perl.  "No, thank you, Perl, keep your greedy fingers off
> >> this chunk.  No, you may not play with it."
> >
> >
> > Out of curiousity, when might such a situation arise?  When you
> >are embedding C in Perl, perhaps?
> 
> Or calling an external library which returns a pointer to data.
> Right now we _have_ to copy it as there is no way to tell perl 
> to (say) XFree() it rather than Safefree() it. Which is a pain when data
> is big.

As long as destroy() is one of the vtable methods, then it should be
fairly easy for someone to write an SV wrapper type that calls a specific
free() - either fixed per type, or per SV.

Roll on perl6 :-)

Dave M.




Re: Backtracking through the source

2000-11-28 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Simon Cozens <[EMAIL PROTECTED]> wrote:

> I doubt it; I get the feeling that what Dan is talking about is infinite
> look-*behind*. Nine times out of ten, you won't need to redo your parsing,
> so having an infinite lookahead will just slow everything down.

I didn't say that having infinite lookahead was better than allowing
backtracking. I simply said that the two were equivalent and that any
problem that can be solved by one can be solved by the other.

> sub bar { ... }
> print foo bar();
>
> Now, having parsed this far, we know that foo is a filehandle that we're
> printing to, so we build up our op tree to print the results of calling bar()
> to a filehandle called foo; ooh, but what do we see now:
>
> sub foo { ... }
>
> Eek, foo was actually a subroutine, and we mean print(foo(bar())); need to
> redo our parse tree. That's when the lookbehind comes into play.

That's quite a nasty example for a number of reasons. Firstly you
might have to back up and reparse a very large amount of code as the
subroutine definition could be a very long way away from the print
statement.

Secondly in order to know that you needed to back up you'd have to
remember that you hadd had to guess that foo was a filehandle but
that it might also be a subroutine and it raises a whole serious of
questions about what other similar things you might need to remember.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/
...Would you buy a Pontiac from this, er, man?




Re: The external interface for the parser piece

2000-11-28 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Dan Sugalski <[EMAIL PROTECTED]> wrote:

> The third parameter is the flags parameter, and it's optional. If omitted
> or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard
> null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter
> is treated as if it points to a stream of bytes, where the first four are
> the length of the source to be read followed by the source. If set to
> PERL_FILE_SOURCE it's assumed to be a FILE *, while if set to
> PERL_GENERATED_SOURCE it's assumed to be a pointer to a function that
> returns a char pointer. If it's OR'd with PERL_UTF8_SOURCE then the stream
> is assumed to be in UTF-8 format instead of platform native.

This all seems a bit horrible to me. That kind of overloading of
multiple meanings onto an argument is often a sign of a bad design
that could be improved.

Applying the maxim that any software design problem can be solved
with sufficient levels of abstraction I'd suggest that passing some
sort of abstract stream pointer would be better. Then there could
be different sorts of streams that provided the source from a string
or a file or whatever other wonderful data source somebody comes up
with.

The common case of parsing a string could of course be simplified
with a small wrapper function that created a string based stream
and then called the main parser entry point.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/
...A private sin is not so prejudicial in the world as a public indecency.




Re: The external interface for the parser piece

2000-11-28 Thread Jarkko Hietaniemi

On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote:
> In message <[EMAIL PROTECTED]>
>   Dan Sugalski <[EMAIL PROTECTED]> wrote:
> 
> > The third parameter is the flags parameter, and it's optional. If omitted
> > or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard
> > null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter
> > is treated as if it points to a stream of bytes, where the first four are
> > the length of the source to be read followed by the source. If set to
> > PERL_FILE_SOURCE it's assumed to be a FILE *, while if set to
> > PERL_GENERATED_SOURCE it's assumed to be a pointer to a function that
> > returns a char pointer. If it's OR'd with PERL_UTF8_SOURCE then the stream
> > is assumed to be in UTF-8 format instead of platform native.
> 
> This all seems a bit horrible to me. That kind of overloading of
> multiple meanings onto an argument is often a sign of a bad design
> that could be improved.

Agreed.

> Applying the maxim that any software design problem can be solved
> with sufficient levels of abstraction I'd suggest that passing some

A related warning sign is trying to cram different semantic levels or
types into same data.  (C's "string model" being perhaps the most
obvious example, getchar() having to be an int is another, "0 but true"
a third...I want a "1 but false" :-)

> sort of abstract stream pointer would be better. Then there could
> be different sorts of streams that provided the source from a string
> or a file or whatever other wonderful data source somebody comes up
> with.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: Backtracking through the source

2000-11-28 Thread Simon Cozens

On Tue, Nov 28, 2000 at 06:58:57PM +, Tom Hughes wrote:
> I didn't say that having infinite lookahead was better than allowing
> backtracking. I simply said that the two were equivalent and that any
> problem that can be solved by one can be solved by the other.

Fair enough.

> That's quite a nasty example for a number of reasons. Firstly you
> might have to back up and reparse a very large amount of code as the
> subroutine definition could be a very long way away from the print
> statement.

You wouldn't have to reparse it all. You'd have to insert the new information
into the parse and see how that changes things. It'd probably only change a
very localised area, a single statement per occurence at most.
 
> Secondly in order to know that you needed to back up you'd have to
> remember that you hadd had to guess that foo was a filehandle but
> that it might also be a subroutine and it raises a whole serious of
> questions about what other similar things you might need to remember.
 
Parsing Perl is not easy. :) At some points, you have to say, well, heck, I
don't *know* what this token is. At the moment, perl guesses, and it guesses
reasonably well. But guessing something wrongly which you could have got right
if you'd read the next line strikes me as a little anti-DWIM. 

In a sense, though, you're right; this is a general problem. I'm currently
trying to work out a design for a tokeniser, and it seems to me that there's
going to be a lot of communicating of "hints" between the tokeniser, the lexer
and the parser. 

The other alternative is to completely conflate the three, which would work
but I think people would lose their minds.

Take, for instance:

${function($value)}[$val]

Now, how on earth do I split this into tokens? Do I say:

/${/ - and expect some stuff which will resolve to a variable name or
   array reference, followed by a }

If we go that way, we're passing lots of hints to both the lexer and the
parser.

/${[^}]+}/ and then /\[[^]]+\]/

If we do that, we have to keep state between the two tokens so that we don't
make [$val] into a reference constructor and stuff up the parser.

/^${([^}]+)}\[([^\])]/ - Deference $1 as an array, take value $2.

If we do *that*, then we're already being tokeniser, lexer and parser rolled
into one.

Parsing Perl is hard. Trust me. :)

-- 
MISTAKES:
It Could Be That The Purpose Of Your Life Is Only To Serve As
A Warning To Others

http://www.despair.com



Re: Backtracking through the source

2000-11-28 Thread Dan Sugalski

At 06:58 PM 11/28/00 +, Tom Hughes wrote:
>In message <[EMAIL PROTECTED]>
>   Simon Cozens <[EMAIL PROTECTED]> wrote:
>
> > I doubt it; I get the feeling that what Dan is talking about is infinite
> > look-*behind*. Nine times out of ten, you won't need to redo your parsing,
> > so having an infinite lookahead will just slow everything down.
>
>I didn't say that having infinite lookahead was better than allowing
>backtracking. I simply said that the two were equivalent and that any
>problem that can be solved by one can be solved by the other.

The big reason I was thinking about this is becase I'd like to be able to 
chop pieces off the front of the 'to be parsed' stream, so regexes can 
start with ^  instead of something else, but that's a minor issue, and one 
that's likely to be caught early-on in testing--I can't see us sending out 
a production release of perl with a whoops like that in the parser source.

> > sub bar { ... }
> > print foo bar();
> >
> > Now, having parsed this far, we know that foo is a filehandle that we're
> > printing to, so we build up our op tree to print the results of calling 
> bar()
> > to a filehandle called foo; ooh, but what do we see now:
> >
> > sub foo { ... }
> >
> > Eek, foo was actually a subroutine, and we mean print(foo(bar())); need to
> > redo our parse tree. That's when the lookbehind comes into play.
>
>That's quite a nasty example for a number of reasons. Firstly you
>might have to back up and reparse a very large amount of code as the
>subroutine definition could be a very long way away from the print
>statement.

Luckily that particular case can probably be dealt with without reparsing 
source--we'd just have to tromp back through the syntax tree and change the 
attribute of a node somewhere. (Or leave it until runtime, I suppose)

It's also something that will be dealt with by the bytecode compiler rather 
than the parser, and it'll have a fully-enough parsed program to deal with 
things appropriately.

(I suppose it would mean that if someone did this in a BEGIN block that the 
code in the BEGIN would treat foo as a filehandle but the rest of the 
program would treat it as a sub call, which could be an issue)

>Secondly in order to know that you needed to back up you'd have to
>remember that you hadd had to guess that foo was a filehandle but
>that it might also be a subroutine and it raises a whole serious of
>questions about what other similar things you might need to remember.

Well, we do have the syntax tree, and can make whatever notes we want in 
the stash of the interpreter we're dealing with. Maybe an "intederminate" 
on the foo slot or something.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: The external interface for the parser piece

2000-11-28 Thread Steve Fink

Dan Sugalski wrote:
> 
>int perl6_parse(PerlInterp *interp,
>void *source,
>int flags,
>void *extra_pointer);

Given that other things may want to be streamable in similar fashion (eg
the regular expression engine), why not have a PerlDataSource union or
somesuch that encapsulates all of the possibilities of the final three
arguments? Or put all possibilities into a PerlIO*? That gives direct
support for compressed source, source streamed over a network socket,
etc., with a more common framework than PERL_GENERATED_SOURCE.

Things like PERL_CHAR_SOURCE meaning nul-terminated char* sound
unnecessarily specific.

Also, you gave two options: nul-terminated and length-first. What about
a "chunked" encoding, where you get multiple length-first chunks of the
input (as in HTTP/1.1's Transfer-Encoding: chunked, for one example of
many)? Or are nuls explicitly forbidden in source code?

And, in a related question, the above interface appears that you call
perl6_parse once. Will this be good enough, or do you want to have a
PerlParseState* in/out parameter that allows restarting a parse once you
get more of the input available? (With this, you don't need an explicit
chunked encoding, since the caller can deal with that without being
required to buffer the whole thing in memory before calling
perl6_parse.) Or would that go into the PerlInterp too?

And finally, how do I get the output out of the PerlInterp? Is it stored
under some variable name, or does the PerlInterp start out empty and
gains the parsed syntax tree as its only syntax tree, or ? (The latter
sounds messy if the PerlInterp is also running code, code that wants to
call some standard utility functions implemented in Perl.) Maybe I'm not
making sense.



Re: The external interface for the parser piece

2000-11-28 Thread Tim Jenness

On Mon, 27 Nov 2000, Dan Sugalski wrote:

> ---
> 
>int perl6_parse(PerlInterp *interp,
>void *source,
>int flags,
>void *extra_pointer);
> 

> The third parameter is the flags parameter, and it's optional. If omitted 
> or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard 
> null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter 
> is treated as if it points to a stream of bytes, where the first four are 
> the length of the source to be read followed by the source. If set to 

Since you have a fourth argument couldn't that be used for the length
of the byte stream rather than embedding that length into the byte stream
itself? Makes more sense to me to separate the bytes from the length.


-- 
Tim Jenness
JCMT software engineer/Support scientist
http://www.jach.hawaii.edu/~timj





Re: The external interface for the parser piece

2000-11-28 Thread Dan Sugalski

At 07:03 PM 11/28/00 +, Tom Hughes wrote:
>In message <[EMAIL PROTECTED]>
>   Dan Sugalski <[EMAIL PROTECTED]> wrote:
>
> > The third parameter is the flags parameter, and it's optional. If omitted
> > or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard
> > null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter
> > is treated as if it points to a stream of bytes, where the first four are
> > the length of the source to be read followed by the source. If set to
> > PERL_FILE_SOURCE it's assumed to be a FILE *, while if set to
> > PERL_GENERATED_SOURCE it's assumed to be a pointer to a function that
> > returns a char pointer. If it's OR'd with PERL_UTF8_SOURCE then the stream
> > is assumed to be in UTF-8 format instead of platform native.
>
>This all seems a bit horrible to me. That kind of overloading of
>multiple meanings onto an argument is often a sign of a bad design
>that could be improved.

Sure, that's distinctly possible. I'm shooting for extreme simplicity in 
the standard case here, but that doesn't mean I'm hitting it. (Or anything 
else for that matter)

>Applying the maxim that any software design problem can be solved
>with sufficient levels of abstraction I'd suggest that passing some
>sort of abstract stream pointer would be better. Then there could
>be different sorts of streams that provided the source from a string
>or a file or whatever other wonderful data source somebody comes up
>with.

Right, and I called my abstract stream "void *source". :)

>The common case of parsing a string could of course be simplified
>with a small wrapper function that created a string based stream
>and then called the main parser entry point.

That means another function in the API. I suppose perl_parse_string() and 
perl_parse_file() are valid options. I'd rather keep the API that embedders 
will be using as small as possible, but two functions with simple names and 
pameters may be better than one function with mildly odd parameters in the 
non-trivial case.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: The external interface for the parser piece

2000-11-28 Thread Dan Sugalski

At 01:25 PM 11/28/00 -0600, Jarkko Hietaniemi wrote:
>On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote:
> > Applying the maxim that any software design problem can be solved
> > with sufficient levels of abstraction I'd suggest that passing some
>
>A related warning sign is trying to cram different semantic levels or
>types into same data.  (C's "string model" being perhaps the most
>obvious example, getchar() having to be an int is another, "0 but true"
>a third...I want a "1 but false" :-)

Which ways is that one being violated? (I can think of a couple 
personally... :)

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: The external interface for the parser piece

2000-11-28 Thread Dan Sugalski

At 09:10 AM 11/28/00 -1000, Tim Jenness wrote:
>On Mon, 27 Nov 2000, Dan Sugalski wrote:
>
> > ---
> >
> >int perl6_parse(PerlInterp *interp,
> >void *source,
> >int flags,
> >void *extra_pointer);
> >
>
> > The third parameter is the flags parameter, and it's optional. If omitted
> > or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard
> > null-terminated string. If set to PERL_COUNTED_SOURCE, the second 
> parameter
> > is treated as if it points to a stream of bytes, where the first four are
> > the length of the source to be read followed by the source. If set to
>
>Since you have a fourth argument couldn't that be used for the length
>of the byte stream rather than embedding that length into the byte stream
>itself? Makes more sense to me to separate the bytes from the length.

I'd rather the stream be self-contained, rather than needing an extra 
argument for the length. Counted strings aren't uncommon outside of C, and 
there's no reason a Fortran or COBOL (or Java, or...) program can't embed perl.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: The external interface for the parser piece

2000-11-28 Thread Nicholas Clark

On Tue, Nov 28, 2000 at 03:35:37PM -0500, Dan Sugalski wrote:
> > > is treated as if it points to a stream of bytes, where the first four are
   

I spy magic number.

> > > the length of the source to be read followed by the source. If set to
> >
> >Since you have a fourth argument couldn't that be used for the length
> >of the byte stream rather than embedding that length into the byte stream
> >itself? Makes more sense to me to separate the bytes from the length.
> 
> I'd rather the stream be self-contained, rather than needing an extra 
> argument for the length. Counted strings aren't uncommon outside of C, and 
> there's no reason a Fortran or COBOL (or Java, or...) program can't embed perl.


Why four? Surely that's imposing an arbitrary binary structure. If it's a
parameter then it's (probably) a machine register and certainly a "natural"
quantity for whatever's running the code (and automatically the correct
endian-ness just in case perl is running in some (oddball partial)
binary emulation environment. Erm. Or something like that.

I forget the source of the quote, but it was to the effect of
C is the only language where not just the binaries but also the source is
not portably.

Say you'd said 2 not 4.

struct counted_file { 
  short count;
  struct  {
char  bytes[1]; 
  } file;
};


erm. can't have bytes[0]; because that's not portable.
Can't really be short because who said that that was 2 bytes?
For that matter I know of one compiler which doesn't have any type
sizeof(2), and sizeof (struct counted_file) is 8 here on this arm machine
:-) Wierdo but ANSI compliant alignment constraints.
[yes, I forced that one using the second struct inside the first]

Nicholas Clark



Re: To get things started...

2000-11-28 Thread Nick Ing-Simmons

Bart Lateur <[EMAIL PROTECTED]> writes:
>
>But what if you choose wrong, forgat a really important one, and this
>instruction gets a multibyte representation? We're stuck with it
>forever...?
>
>I have had some thoughts on "dynamic opcodes", where the meaning of
>opcode bytes needn't be fixed, but can be dynamically assigned,
>depending on how often they occur (for example). A bit like how a
>Huffman compressor may choose shorter representations for the most
>occurring byte patterns.

This is just like HW processor opcodes.  x86 has lasted so well
because the initial guess at the short/common opcodes was not too bad.
But the escape bytes are getting out of hand now...

-- 
Nick Ing-Simmons <[EMAIL PROTECTED]>
Via, but not speaking for: Texas Instruments Ltd.




Re: The external interface for the parser piece

2000-11-28 Thread Jarkko Hietaniemi

On Tue, Nov 28, 2000 at 03:34:22PM -0500, Dan Sugalski wrote:
> At 01:25 PM 11/28/00 -0600, Jarkko Hietaniemi wrote:
> >On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote:
> > > Applying the maxim that any software design problem can be solved
> > > with sufficient levels of abstraction I'd suggest that passing some
> >
> >A related warning sign is trying to cram different semantic levels or
> >types into same data.  (C's "string model" being perhaps the most
> >obvious example, getchar() having to be an int is another, "0 but true"
> >a third...I want a "1 but false" :-)
> 
> Which ways is that one being violated? (I can think of a couple 
> personally... :)

Embedding the (fixed-length) length into the data.  As Nicholas points
out, that is naughty.  Remember:

sizeof(char) >= sizeof(short) >= sizeof(int) >= sizeof(long)
sizeof(char) == 1

(IIRC) are the only guarantees you get.  No structure alignment/padding
guarantees.  Let's pick a platform that would have difficulties:
Cray C-series (nowadays called SV-series, I think).  There's *no*
integer data type four bytes wide (or two bytes, for that matter).
It's either 1 (char), or 8.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: The external interface for the parser piece

2000-11-28 Thread Chaim Frenkel

Err, this seems a little too Swiss Army Knife. 

This reads like a utility function. (i.e. A function that handles the
most common scenerio.)

Shouldn't a set of lower level visible API be visible? One that seems
to pop out at me is some way of actually parsing a piece of code and
ending up with a handle on a syntax tree. And ways of adding and removing
these pieces.

These are abstract functions that would be needed on the interior of the
parser, but a bottom up approach may be more appropriate here.

I also like the suggestion that rather than supply flags, we should
follow the lead and supply a Perl* something that would return an
appropriate bunch of text to the parser.



> "DS" == Dan Sugalski <[EMAIL PROTECTED]> writes:

DS> While I'm not sure of the structure of the internals of the parsing piece 
DS> of perl at the moment (and, unfortunately, language parsers aren't one of 
DS> my strong points), I am reasonably certain of the interface we'll present 
DS> to the rest of the world and the other pieces of perl. So... comments?

DS> ---

DS>int perl6_parse(PerlInterp *interp,
DS>void *source,
DS>int flags,
DS>void *extra_pointer);

DS> The first parameter is a pointer to a perl interpreter--this'll be used if 
DS> any code needs to be executed, as well as being a repository for any 
DS> variables that compiled code may set. (Standard stash stuff) The syntax 
DS> tree the parser generates will also be embedded here. (One fewer parameter 
DS> to deal with, and one fewer thing for an embedding program to track)

DS> The second parameter is a pointer to the source to be compiled. This is 
DS> generally a char pointer, but it may also be a FILE * or a pointer to a 
DS> function that returns a char pointer.

DS> The third parameter is the flags parameter, and it's optional. If omitted 
DS> or set to PERL_CHAR_SOURCE, the second parameter is treated as a standard 
DS> null-terminated string. If set to PERL_COUNTED_SOURCE, the second parameter 
DS> is treated as if it points to a stream of bytes, where the first four are 
DS> the length of the source to be read followed by the source. If set to 
DS> PERL_FILE_SOURCE it's assumed to be a FILE *, while if set to 
DS> PERL_GENERATED_SOURCE it's assumed to be a pointer to a function that 
DS> returns a char pointer. If it's OR'd with PERL_UTF8_SOURCE then the stream 
DS> is assumed to be in UTF-8 format instead of platform native.

DS> The fourth parameter is only used if the flags are set to 
DS> PERL_GENERATED_SOURCE, in which case it is passed back to the function 
DS> whose pointer we got as parameter two.

DS> Dan

DS> --"it's like this"---
DS> Dan Sugalski  even samurai
DS> [EMAIL PROTECTED] have teddy bears and even
DS>   teddy bears get drunk




-- 
Chaim FrenkelNonlinear Knowledge, Inc.
[EMAIL PROTECTED]   +1-718-236-0183



Re: The external interface for the parser piece

2000-11-28 Thread Jarkko Hietaniemi

On Tue, Nov 28, 2000 at 03:15:35PM -0600, Jarkko Hietaniemi wrote:
> On Tue, Nov 28, 2000 at 03:34:22PM -0500, Dan Sugalski wrote:
> > At 01:25 PM 11/28/00 -0600, Jarkko Hietaniemi wrote:
> > >On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote:
> > > > Applying the maxim that any software design problem can be solved
> > > > with sufficient levels of abstraction I'd suggest that passing some
> > >
> > >A related warning sign is trying to cram different semantic levels or
> > >types into same data.  (C's "string model" being perhaps the most
> > >obvious example, getchar() having to be an int is another, "0 but true"
> > >a third...I want a "1 but false" :-)
> > 
> > Which ways is that one being violated? (I can think of a couple 
> > personally... :)
> 
> Embedding the (fixed-length) length into the data.  As Nicholas points
> out, that is naughty.  Remember:
> 
>   sizeof(char) >= sizeof(short) >= sizeof(int) >= sizeof(long)

I think it's time for me to go home for today.  Please reverse the >
signs as you read :-)

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: The external interface for the parser piece

2000-11-28 Thread Dan Sugalski

At 03:15 PM 11/28/00 -0600, Jarkko Hietaniemi wrote:
>On Tue, Nov 28, 2000 at 03:34:22PM -0500, Dan Sugalski wrote:
> > At 01:25 PM 11/28/00 -0600, Jarkko Hietaniemi wrote:
> > >On Tue, Nov 28, 2000 at 07:03:49PM +, Tom Hughes wrote:
> > > > Applying the maxim that any software design problem can be solved
> > > > with sufficient levels of abstraction I'd suggest that passing some
> > >
> > >A related warning sign is trying to cram different semantic levels or
> > >types into same data.  (C's "string model" being perhaps the most
> > >obvious example, getchar() having to be an int is another, "0 but true"
> > >a third...I want a "1 but false" :-)
> >
> > Which ways is that one being violated? (I can think of a couple
> > personally... :)
>
>Embedding the (fixed-length) length into the data.  As Nicholas points
>out, that is naughty.  Remember:
>
> sizeof(char) >= sizeof(short) >= sizeof(int) >= sizeof(long)
> sizeof(char) == 1
>
>(IIRC) are the only guarantees you get.  No structure alignment/padding
>guarantees.  Let's pick a platform that would have difficulties:
>Cray C-series (nowadays called SV-series, I think).  There's *no*
>integer data type four bytes wide (or two bytes, for that matter).
>It's either 1 (char), or 8.

There's always:

   length = (getc() * 256) + getc()) * 256) + getc()) * 256) + getc()

give or take a few parens...

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: The external interface for the parser piece

2000-11-28 Thread Dan Sugalski

At 09:05 PM 11/28/00 +, Nicholas Clark wrote:
>On Tue, Nov 28, 2000 at 03:35:37PM -0500, Dan Sugalski wrote:
> > > > is treated as if it points to a stream of bytes, where the first 
> four are
>
>
>I spy magic number.

Nah. 32-bit length. If someone needs to pass us more than 4G of source 
code, I do *not* want to know about it. :)

> > > > the length of the source to be read followed by the source. If set to
> > >
> > >Since you have a fourth argument couldn't that be used for the length
> > >of the byte stream rather than embedding that length into the byte stream
> > >itself? Makes more sense to me to separate the bytes from the length.
> >
> > I'd rather the stream be self-contained, rather than needing an extra
> > argument for the length. Counted strings aren't uncommon outside of C, and
> > there's no reason a Fortran or COBOL (or Java, or...) program can't 
> embed perl.
>
>
>Why four? Surely that's imposing an arbitrary binary structure. If it's a
>parameter then it's (probably) a machine register and certainly a "natural"
>quantity for whatever's running the code (and automatically the correct
>endian-ness just in case perl is running in some (oddball partial)
>binary emulation environment. Erm. Or something like that.

It's not necessarily in a register. In at least some of the languages I 
named (and you can add BASIC and pascal to the list as well), a string 
consists of a length and data pointer pair, usually together. What's handy 
is a pointer to the data structure, not the length and a pointer to the buffer.

Of course, for some of those languages the lengths are 16-bit quantities. Damn.

>I forget the source of the quote, but it was to the effect of
>C is the only language where not just the binaries but also the source is
>not portably.
>
>Say you'd said 2 not 4.
>
>struct counted_file {
>   short count;
>   struct  {
> char  bytes[1];
>   } file;
>};
>
>
>erm. can't have bytes[0]; because that's not portable.

That'd probably be:

   struct counted_string {
 int length;
 char data[];
   }

which is legal ANSI C. Not that it helps with the size of an int issue, though.

>Can't really be short because who said that that was 2 bytes?
>For that matter I know of one compiler which doesn't have any type
>sizeof(2), and sizeof (struct counted_file) is 8 here on this arm machine
>:-) Wierdo but ANSI compliant alignment constraints.
>[yes, I forced that one using the second struct inside the first]

Y'know, I really loathe C. Really, really, loathe it.

Anyway, regardless of the platform, there is *some* way to force this to 
work--if there weren't, then implementing things like a TCP stack would be 
pretty much impossible.

Counted strings should probably just have either a platform-native int in 
front, or a 32-bit int in network format, both of which should be doable on 
any platform that perl deals with.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: To get things started...

2000-11-28 Thread Nick Ing-Simmons

Nicholas Clark <[EMAIL PROTECTED]> writes:
>On Mon, Nov 27, 2000 at 05:17:47PM +, Nicholas Clark wrote:
>> On Mon, Nov 27, 2000 at 11:09:03AM -0500, Chaim Frenkel wrote:
>> > > "ST" == Sam Tregar <[EMAIL PROTECTED]> writes:
>> 
>> > Look throught the RFCs this was one of Damian Conway's.
>> > 
>> >  =~ /RFC/
>> 
>> http://dev.perl.org/rfc/93.html
>> 
>> I know I read it, I just don't remember reading it.
>> 
>>   IMPLEMENTATION
>> 
>>   Dammit, Jim, I'm a doctor, not an magician! 
>> 
>>   Probably needs to be integrated with IO disciplines too. 
>> 
>> He's right, but Nick's intending to implement an unread() (rather than just
>> ungetc()) so there should be enough rope for people to implement whatever
>> knots take their fancy (including the Jack Ketch knot)
>> 
>> Hugo makes some comments about implementation of this in:
>> http:[EMAIL PROTECTED]/msg00459.html
>
>Bah. meant to add that it might be logical for
>
> =~ /RFC/
>
>to seek to the beginning of the file before it starts
>
> =~ /\GRFC/gc
>
>carries on from the previous position and doesn't seek back to the beginning
>(or otherwise throw all the buffered data away)
>
>Which effectively makes pos analogous to seek/tell.

I was musing on how to make "layers" visible to perl code.
And using pos() to point at the current position in the buffer
(note the _buffer_ not the _file) was one idea I came up with.

>So do we get rid of poss and seek() our scalars? :-)

Keep pos() and loose seek ;-)

>It also allows the possibility of pos on file handles being fsetpos/fgetpos
>Maybe that should have been an rfc 3 months ago, and really doesn't even
>matter if perlio obsoletes stdio and internalises stdio's distinction between
>text and binary streams.
>
>BTW I am serious about needing a /gc not to chuck the buffered data.
>
>It makes something like
>
>   @found =  =~ /RFC +(\d+)/;
>
>not spend time stacking a lot of data back that's only about to be discarded.
>
>But this isn't internals really, is it? I'm miles off topic.
>
>Nicholas Clark
-- 
Nick Ing-Simmons <[EMAIL PROTECTED]>
Via, but not speaking for: Texas Instruments Ltd.




Re: The external interface for the parser piece

2000-11-28 Thread Dan Sugalski

At 04:23 PM 11/28/00 -0500, Chaim Frenkel wrote:
>Err, this seems a little too Swiss Army Knife.
>
>This reads like a utility function. (i.e. A function that handles the
>most common scenerio.)

What it's supposed to be is the highest-level interface to the parser, and 
so it's supposed to handle all the common cases without requiring whoever's 
using it to read more than half a page of documentation, total.

>Shouldn't a set of lower level visible API be visible? One that seems
>to pop out at me is some way of actually parsing a piece of code and
>ending up with a handle on a syntax tree. And ways of adding and removing
>these pieces.

Sure. That would be the internal API bit. Nobody's put anything solid 
forward yet for that bit.

Anyone? Anyone? Bueller?

>These are abstract functions that would be needed on the interior of the
>parser, but a bottom up approach may be more appropriate here.

Sure. Suggestions?

>I also like the suggestion that rather than supply flags, we should
>follow the lead and supply a Perl* something that would return an
>appropriate bunch of text to the parser.

I'd really rather not, since that would place the burden of knowing too 
much about the guts of perl on whoever's using it. I don't want to do that.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Backtracking through the source

2000-11-28 Thread raptor

> Is there any reasonable case where we would need to backtrack over
> successfully parsed source and redo the parsing? I'm not talking about the
> case where regular expressions run over text and ultimately fail, but
> rather cases where we need to chuck out part of what we have and restart?

]- I think that we should have this possibility Of course if something
can be solved w/o backtracking it will be solved in that way . perl
parser will not parse only PERL but also many other TARGET-languages .
and one more feature will not be worse but better (backtracking, score based
decision, lookbehind etc... )
=
iVAN
[EMAIL PROTECTED]
=




Re: The external interface for the parser piece

2000-11-28 Thread Dan Sugalski

At 09:48 AM 11/28/00 -0800, Steve Fink wrote:
>Dan Sugalski wrote:
> >
> >int perl6_parse(PerlInterp *interp,
> >void *source,
> >int flags,
> >void *extra_pointer);
>
>Given that other things may want to be streamable in similar fashion (eg
>the regular expression engine), why not have a PerlDataSource union or
>somesuch that encapsulates all of the possibilities of the final three
>arguments? Or put all possibilities into a PerlIO*? That gives direct
>support for compressed source, source streamed over a network socket,
>etc., with a more common framework than PERL_GENERATED_SOURCE.

Embedding is the big reason. This interface should be simple for embedding 
programs, most of which will either pass in a C filehandle or a plain char* 
with source in it. That's why there's no fancy structures or anything that 
go in. (Well, besides the perlinterp structure, but that's pretty much a 
magic cookie as far as programs are concerned)

>Things like PERL_CHAR_SOURCE meaning nul-terminated char* sound
>unnecessarily specific.

Well, it is the most common type of string that perl's going to see, which 
is why it's in there. UTF-8's the next most likely one, hence the flag.

>Also, you gave two options: nul-terminated and length-first. What about
>a "chunked" encoding, where you get multiple length-first chunks of the
>input (as in HTTP/1.1's Transfer-Encoding: chunked, for one example of
>many)? Or are nuls explicitly forbidden in source code?

Nulls aren't explicitly forbidden, but they're real inconvenient in C-style 
strings, hence the length option. (Plus we might be able to do Clever 
Things if we know the length) I'm not sure how UTF-8 jammed into C strings 
works either, since IIRC there can be null bytes in a UTF-8 data stream.

Nulls are OK in the source on disk, though they're still annoying inside a 
C program. (Like, say, perl... :)

>And, in a related question, the above interface appears that you call
>perl6_parse once. Will this be good enough, or do you want to have a
>PerlParseState* in/out parameter that allows restarting a parse once you
>get more of the input available? (With this, you don't need an explicit
>chunked encoding, since the caller can deal with that without being
>required to buffer the whole thing in memory before calling
>perl6_parse.) Or would that go into the PerlInterp too?

What I was thinking, but didn't say, is that for the PERL_GENERATED_SOURCE 
case we'd just call the function provided over and over until it returns 
NULL, at which point we assume it's all done. So for the chunked text case, 
each call to the function would return a chunk, and the function would 
return NULL when it's run out of chunks.

>And finally, how do I get the output out of the PerlInterp? Is it stored
>under some variable name, or does the PerlInterp start out empty and
>gains the parsed syntax tree as its only syntax tree, or ? (The latter
>sounds messy if the PerlInterp is also running code, code that wants to
>call some standard utility functions implemented in Perl.) Maybe I'm not
>making sense.

It's stored in the PerlInterp structure. Where I don't know, but that can 
be put off for later.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Basic embedding [was: Re: The external interface for the parser piece]

2000-11-28 Thread Benjamin Stuhl


--- Steve Fink <[EMAIL PROTECTED]> wrote:
> Dan Sugalski wrote:
> > 
> >int perl6_parse(PerlInterp *interp,
> >void *source,
> >int flags,
> >void *extra_pointer);
> 
> Given that other things may want to be streamable in
> similar fashion (eg
> the regular expression engine), why not have a
> PerlDataSource union or
> somesuch that encapsulates all of the possibilities of
> the final three
> arguments? Or put all possibilities into a PerlIO*? That
> gives direct
> support for compressed source, source streamed over a
> network socket,
> etc., with a more common framework than
> PERL_GENERATED_SOURCE.

Hear, hear! This is almost an embedding issue, though
(cc-ing perl6-internals-api-embed): How much of the
standard perl RTL is _required_ (I.e., PerlIO, perl malloc,
etc.). Offhand, I think that there is a very strong case to
require at least the basic PerlIO, since without it, perl6
can't count on having a non-bugridden I/O library, and also
can't take advantage of PerlIO's non-std. features
(whatever they end up being).
 
> Things like PERL_CHAR_SOURCE meaning nul-terminated char*
> sound
> unnecessarily specific.
> 
> Also, you gave two options: nul-terminated and
> length-first. What about
> a "chunked" encoding, where you get multiple length-first
> chunks of the
> input (as in HTTP/1.1's Transfer-Encoding: chunked, for
> one example of
> many)? Or are nuls explicitly forbidden in source code?
> 
> And, in a related question, the above interface appears
> that you call
> perl6_parse once. Will this be good enough, or do you
> want to have a
> PerlParseState* in/out parameter that allows restarting a
> parse once you
> get more of the input available? (With this, you don't
> need an explicit
> chunked encoding, since the caller can deal with that
> without being
> required to buffer the whole thing in memory before
> calling
> perl6_parse.) Or would that go into the PerlInterp too?
> 
> And finally, how do I get the output out of the
> PerlInterp? Is it stored
> under some variable name, or does the PerlInterp start
> out empty and
> gains the parsed syntax tree as its only syntax tree, or
> ? (The latter
> sounds messy if the PerlInterp is also running code, code
> that wants to
> call some standard utility functions implemented in
> Perl.) Maybe I'm not
> making sense.

This sort of leads into an idea I've been having about what
defines an interpreter. I've sort of been musing on the
following embedding interface:

/* inits subsystems: PerlIO,memory,etc. call once at start
of program */
int perl_boot(); 

/* subsystem shutdown - call at program shutdown */
int perl_shutdown();

/* a perl6 interpreter - defines complete interpreter*/
typedef struct _perl_interp perl_interpreter;
typedef struct _perl_thread perl_thread;
struct _perl_interp {
   perl_thread *thread_list;
   perl_thread *root_thread; /* "top-level" thread - used
to parse the primary script (or provide an arbitrary
perl_thread for embedders) */
   HV *shared_stash;  /* subroutines are global to an
interpreter */
   HV *subroutine_stash;
...
};
/* a thread of execution in a perl_interpreter - contain's
thread's stash and stacks */
struct _perl_thread {
   perl_interpreter *threads_interp;
   OP *pc;
   SV *sp;
   HV *thread_stash;
   void *save_stack;
   RE_context *RE_data;
   perl_parser_state *parser;
...
};

/* creates an interpreter */
perl_interpreter * perl_create_interp(int flags);

/* ... ways of calling in (parse command line, call code,
etc.)... */

/* destroy and free an interpreter */
void perl_delete_interp(perl_interpreter *);

/* the embedder is expected to provide the following */
/* get this OS thread's current perl_thread */
perl_thread* perl_fetch_thread();
/* set this OS thread's current perl_thread (called in
Thread->new &co.) */
perl_thread* perl_set_thread();
/* get the perl_thread who will handle signals */
perl_thread* perl_get_sig_thread();

The idea behind the perl_interpreter/perl_thread separation
is that perl6 internal calls will actually pass a
perl_thread * around, since that is the basic unit of
execution, and if bytecode/optree is to be shared between
threads (as I devoutly hope it will be), there needs to be
something to aggregate a group of perl_threads. 

To go back to parser API design, I think that
perl6_parse_perl should take a perl_thread* to provide
context for sub {} declarations, parse errors, &co.
Top-level code would be treated as either the top-level
script, or an eval'', depending on the flags. 

-- BKS

__
Do You Yahoo!?
Yahoo! Shopping - Thousands of Stores. Millions of Products.
http://shopping.yahoo.com/



Re: The external interface for the parser piece

2000-11-28 Thread Steve Fink

Dan Sugalski wrote:
> 
> Sure. Suggestions?

int perl6_parse(PerlInterp* interp, PerlIO* input);
PerlIO* make_memory_stream(char* buf, ssize_t length); // length=-1 for
nul-terminated
int close_stream(PerlIO* stream);

then if you read further, you'll eventually see:

PerlIO* make_callback_stream(int (*f)(char* buf, int space, void*
other), void* other);

Or maybe the first thing you see is just:

int perl6_parse_string(PerlInterp* interp, char* buf, ssize_t length);
// length=-1 for nul-term

if that really is 95% of the cases.

I guess I just think that when discussing perl6_parse, it's less effort
to mention the existence of make_memory_stream and close_stream than it
is to explain the meaning of three mystery parameters. Especially if
that knowledge can be reused for half a dozen other API calls. 

perl6_Scalar* perl6_eval_scalar(PerlInterp*,PerlIO*);
perl6_List* perl6_eval_list(PerlInterp*,PerlIO*);
int perl6_load_module(PerlInterp*, PerlIO*); // Checks magic number for
.pm vs .pmc, BOM, gzip...



Re: The external interface for the parser piece

2000-11-28 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Dan Sugalski <[EMAIL PROTECTED]> wrote:

> Right, and I called my abstract stream "void *source". :)

It isn't really abstract though as it only understand types of streams
that the parser author had thought of. An abstract stream would have a
vtable or something so that the parser didn't have to know anything
about where the data was coming from, thus decoupling the parser more
from the text it is parsing. It would also be typesafe.

> That means another function in the API. I suppose perl_parse_string() and
> perl_parse_file() are valid options. I'd rather keep the API that embedders
> will be using as small as possible, but two functions with simple names and
> pameters may be better than one function with mildly odd parameters in the
> non-trivial case.

I would probably suggest something like this:

  int perl_parse(PerlInterp *interp, PerlStream *source)
  {
...
  }

  int perl_parse_string(PerlInterp *interp, const char *source)
  {
PerlStream *stream = new_string_stream(source);

return perl_parse(interp, stream);
  }

  int perl_parse_file(PerlInterp *interp, const char *filename)
  {
PerlStream *stream = new_file_stream(filename, "r");

return perl_parse(interp, stream);
  }

You are of course quite right that it adds functions to the API but
is the number of functions in the API critical? I would have thought
that the above provides a good trade off between simplicity for most
people and power for those that need it whilst still maintaining
type safety and maximum extensibility for things we havn't thought
of yet.

We might also still want a flags word to each of those routine for
things like your UTF8 flag.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/
...F u cn rd ths u cnt spl wrth a dm!




Re: Backtracking through the source

2000-11-28 Thread Tom Hughes

In message <[EMAIL PROTECTED]>
  Simon Cozens <[EMAIL PROTECTED]> wrote:

> Parsing Perl is not easy. :)

You can say that again ;-)

> At some points, you have to say, well, heck, I don't *know* what this token
> is. At the moment, perl guesses, and it guesses reasonably well. But
> guessing something wrongly which you could have got right if you'd read the
> next line strikes me as a little anti-DWIM.

Quite likely you're right. I can't say I have much experience of
parsers that do this but we can always blaze a new trail in our
efforts to parse perl.

> In a sense, though, you're right; this is a general problem. I'm currently
> trying to work out a design for a tokeniser, and it seems to me that
> there's going to be a lot of communicating of "hints" between the
> tokeniser, the lexer and the parser.

You have to be vary careful about downward communication from the
parser to the lexer if there's any lookahead involved as you can
find that you're trying to affect the lexing of tokens which are
already in the lookahead buffer of the parser.

Backtracking may well be better than lookahead here as you can always
jump back a bit after you change the lexer's state ;-)

> Parsing Perl is hard. Trust me. :)

Oh, you did say it again...

Parsing Fortran is fun as well. Whoever decided to allow spaces in
identifiers needs their head read...

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/
...Who's on first?




Re: Backtracking through the source

2000-11-28 Thread Steve Fink

Tom Hughes wrote:
> 
> In message <[EMAIL PROTECTED]>
>   Simon Cozens <[EMAIL PROTECTED]> wrote:
> 
> > In a sense, though, you're right; this is a general problem. I'm currently
> > trying to work out a design for a tokeniser, and it seems to me that
> > there's going to be a lot of communicating of "hints" between the
> > tokeniser, the lexer and the parser.
> 
> You have to be vary careful about downward communication from the
> parser to the lexer if there's any lookahead involved as you can
> find that you're trying to affect the lexing of tokens which are
> already in the lookahead buffer of the parser.
> 
> Backtracking may well be better than lookahead here as you can always
> jump back a bit after you change the lexer's state ;-)

The difference may be more illusory than real. In either case, you have
to undo something: your recursion stack for backtracking, some parser
state for lookahead. And in both cases, undoing those things is a hell
of a lot easier than undoing the code that was run when you recognized
(or thought you recognized) some chunk of tokens as an anonymous sub or
whatever. For example, say you stuck an entry for the subroutine into
the package's symbol table. You'd better have kept the original, in case
you were wrong -- but you might not want to keep all originals, or
you'll blow your memory. Perhaps we can avoid doing anything significant
during parsing (when will BEGIN{} run?), but perhaps not.

Handling the parser's state can be done in a backtracking DFA-like or a
direct NFA-like way. The NFA way is to keep track of all possible parse
states and advance each one in parallel based on the next token. The DFA
way is recursive descent, backing out of blind alleys and trying again,
keeping a single working hypothesis alive at a time. The DFA approach is
probably easier to undo user code in, because in the NFA case you have
to consider each token under the assumptions of all possible parses up
to that point. The NFA case has the advantage that you never have to
back up, so you can permanently forget about a token as soon as it
whizzes by.

Perl5 is parseable with a single token of lookahead and lots of
parser/lexer communication. Sort of. It would be nice to prevent it from
getting any worse. We could pretend to support full DWIMmery by telling
the user when it fails:

10 print foo bar();
13 sub foo { ... }

DWIMmery badness 1: Sorry, but I screwed up by assuming 'foo' was a
direct object in line 10, and only found out on line 13. Would you mind
predeclaring 'foo' somewhere before line 10?

...but that would be weird.

print foo bar();
eval "sub foo { $code }";
print foo bar();



Re: The external interface for the parser piece

2000-11-28 Thread Bryan C. Warnock

On Tue, 28 Nov 2000, Dan Sugalski wrote:
> >I also like the suggestion that rather than supply flags, we should
> >follow the lead and supply a Perl* something that would return an
> >appropriate bunch of text to the parser.
> 
> I'd really rather not, since that would place the burden of knowing too 
> much about the guts of perl on whoever's using it. I don't want to do that.

You're going to need knowledge in either case - whether you're directly
setting flags or have a PerlFlags object (with its own limited interface,
I suppose) that you pass in.   The advantage of the object is that you aren't
limited to just flags down the road, which may cut down on the number of
overall API calls that exist.

Of course, if you've got a couple dozen actual flags, you may want to combine
the two:

PerlFlags *flags_and_such;

PAPI_set_flags(flags_and_such, PL_DONT_CRASH | PL_RUN_FASTER | PL_DWIM);

PAPI_set_malloc_arena(flags_and_such, *malloc_func, *arena); 

return_code = perl6_parse(interp, source, flags_and_such, NULLP);


 -- 
Bryan C. Warnock
bwarnock@(gtemail.net|capita.com)