Re: RegExp and XML

R. Joseph Newton Fri, 21 Feb 2003 17:06:24 -0800

Vincent O' Keeffe wrote:

> <?xml version="1.0"?>
> ...
>
> </Order>
> So, I need to remove everything before the opening <?xml string and, again, 
> everything after the closing </Order> tag. I thought about stripping out the first 4 
> and last 4 lines of the file but the messages sometimes arrive clean, and sometimes 
> with this extra info.


Hi Vincent,

Reread what y6ou wrote above.  You have your algorithm there.

while (!($line =~ /<\?xml/i)) {$line = <RAW_FILE>}
while (!($line =~ /<\/Order/)) {
   print CLEAN_FILE $line;
   $line = <RAW_FILE>;
}
print CLEAN_FILE $line;  #prints closing Order tag


Of course, if you have more than one Order tag then you may have to look for other 
characteristic indications that you have moved into the wrapper material.  How do you 
tell when looking at it that you have passed the end of xml.

I have never ventured into xml myself, but in HTML I always close with an </html> tag. 
 That makes the job pretty easy.  If you don't have any such clear indication of the 
close of the xml, then you may need to use a buffer array to hold lines accrued while 
awaiting some further indication of whether the material should be included in the 
body.

Maybe accrue all lines after closing tags for all open blocks have been reached until 
either a new opening tag is found, or end-of-file.  If you find a new opening tag, 
print all line in the buffer our to the clean file, then keep printing all lines until 
the block thus opened is closed. Then back to buffering ...

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: RegExp and XML

Reply via email to