[PHP-DEV] Need two simple string funcs for parsing

2004-07-08 Thread jtl_phpdotnet
Hello,

My name is Joe Lapp, and I have written high-speed portal-side parsers in
Java for XML, HTML, and various other XML-related syntaxes (e.g. XQL).

I am planning a series of new parsing technologies that I'd like to 
implement in PHP.  To allow my parsers to perform with high efficiency in 
PHP, I need two new string functions.  One is identical to strpbrk() but 
would also take a starting-offset parameter.

Here are the two new functions:

/* strpbrk -- Returns the offset into a string of the first occurrence of 
any character found in a list of provided characters, optionally scanning 
the string starting from a provided string offset. */

strpbrk(string haystack, string char_list [, int starting_offset])

/* strnpbrk -- Returns the offset into a string of the first occurrence of a
character NOT found in a list of provided characters, optionally scanning 
the string starting from a provided string offset. */

strnpbrk(string haystack, string char_list [, int starting_offset])

In other words, strpbrk() would function as it does currently, but it would 
take a starting_offset.  strnpbrk() would be almost identical to this new 
strpbrk(), except that it skips over characters found in the provided 
character list and returns the position of the first character that is not 
in the list.

(BTW, I'm not real fond of C-lib style cryptic names.  I'd much prefer
string functions with readable names that are also good mnemonics.
Maybe scan_for_char() and skip_over_chars() would be better names.)

Ideally, these functions would also support a way to specify characters by 
their unicode values and a way to specify a range of characters.  For 
example, "#8230;A-Z<>" would name the ellipsis character ("#8230;"), the 
characters from A to Z, and the angle bracket characters.

The significance of these functions is purely processing speed.  They would 
allow me to create high-speed parsers and distribute them as uncompiled PHP.  
If the functions are implemented properly, using them should produce much 
faster code than the equivalent compiled PHP.  The starting offset is
necessary to avoid creating a proliferation of substrings that would 
significantly slow down parsing speed.

What are the odds that we can get such functions into PHP 5?  I am planning 
a high-speed XML filtering technology for XML-replication servers in PHP.  I 
want to make this engine free as well as a particular application of this 
engine that I think could create a whole new mode of using the net.  Speed
is very important because of the amount of XML being processed.  I cannot
use existing XML processors for the filtering function I have in mind.  In
any case, these two new functions would allow people to easily create any
sort of high-speed parser.

I fear that without these functions, I'd have to distribute this new server 
as compiled PHP and perhaps require faster server hardware (more clock cycles 
available to the user per unit time) than most users currently have.  Maybe
that's not a problem, except perhaps for my wallet.  I don't know what sort
of Zend license I'd require to be able to distribute free pre-compiled code.

I am also an experience C/C++ programmer and can write these functions 
myself.  Before doing so, though, I'd like to know if I should bother.  
Would they make it into PHP 5?

Thanks for your help!
~joe

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-DEV] Lamenting PHP's streaming support...

2004-08-01 Thread jtl_phpdotnet
Hi everyone,

I'm trying to write some serious parsing applications in PHP.  I find myself 
frequently lamenting the 4GL-like support for buffered streams.  I'd rather a full 
fledged streaming API with stream handles (or objects) like you get in mature 3GL 
languages like C and Java.

I'm making do with the single character-stream buffer available to me in the "output 
buffer."  I wrap this stream in classes that emulate distinct character streams by 
saving the current output buffer, clearing the output buffer for the new virtual 
stream, and then restoring the original output buffer when the virtual stream is 
closed.

This works, but it costs in overhead and requires repeatedly creating string objects 
to store old buffers and then rewriting those objects back to the output buffer.  This 
is less than ideal from both a performance standpoint and a complexity standpoint (and 
an increased potential for wierd errors).

I'm not too concerned about the performance issues of these virtual buffers because I 
can architect the application so that it minimizes these switches.  However, I find 
myself (so far) unable to architect around another serious performance issue.

I'm having to create a new string for each character sequence that I write to the 
output buffer.  I'd rather just copy the substring of the document being parsed 
directly to the output buffer.  Object creation is an expensive activity when 
thousands of objects needed to be created for a single page hit.

All I need to deal with this problem is a new PHP function:

ob_write($string, $start, $length)

This would write the characters in substr($string, $start, $length) to the output 
buffer without creating an intermediate string object.

Is there anything on the horizon that would give me the kind of streaming support I'm 
looking for?

Thanks for your help!
~joe

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-DEV] Substring writes and buffered char streams

2004-08-07 Thread jtl_phpdotnet
Hello PHP gurus,

The php-general list does not believe that PHP allows me to do either of the following:

(1) Writing an arbitrary substring of a string directly to a stream without first 
creating a string object for the substring.  I.E. There is no print($string, $start, 
$length) or fwrite($resource, $string, $length, $start).

(2) Creating multiple independent buffered characters streams.  It appears that stdout 
is the only instance available.

I need to be able to do the first to prevent a costly proliferation of string objects 
when parsing an input string and producing a new output string from it's substrings.  
My experience writing Java parsers for business portals clearly demonstrates that 
object creation, and particularly string creation, is a limiting factor to throughput. 
 Fixing this problem in PHP seems easy: we just need to add an optional start-offset 
parameter to fwrite().

I don't think I absolutely need the second feature, as I'm emulating multiple buffered 
character streams by saving and restoring the contents of stdout (via the output 
buffer) when switching between instances.  I just have to keep the application smart 
about switching so that I can minimize the switch costs.  However, it's possible that 
this could become an issue.

I can't create or use a PHP module since my customers generally only have FTP access 
to their web sites -- not even telnet, much less the ability to customize their PHP 
configuration.

So, is it truly impossible to do these things?  If so, when could I hope to see these 
features -- or at least feature (1) -- ship with core PHP?

Thank you for your help!
~joe

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Substring writes and buffered char streams

2004-08-07 Thread jtl_phpdotnet
Hi Wez.  Here are some clarifications...

On 8/7/2004 Wez. wrote:
>I suppose we could add that. Keep in mind that strings in PHP aren't
>hugely expensive unless you are doing something wrong (tm) like using
>10MB strings.

Strings are cheap in Java too.  The issue is object creation and cleanup.  When the 
strings are very large or very numerous, we could be talking about thousands of 
substrings per page hit.  This increases the strain on both the clock speed and the 
memory of the host machine.  Theory aside, I can get as much as a tenfold improvement 
in throughput with such techniques in Java.

>$fp = fopen(...) ?
>$fp = tmpfile() ?

Right, I should have mentioned this possibility.  The main reason for taking these 
precautions is throughput.  I need in-memory streams.  Can I create a memory-only file?

Thanks for your help!
~joe

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php