Paul Tremblay wrote:
> 
> I just finished my first version of a script that converts rtf to
> xml and was wondering if I went about writing it the wrong way.
> 
> My method was to read in one line at a time and split the lines
> into tokens, and then to read one token at a time. I used this
> line to split  up the text:
> 
> @tokens = split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|(\\\\)|(})|(\\})/,$line);
> 
> Splitting up the text on my test file of 1.8 megabytes tooks 25
> seconds. The entire script took 50 seconds.

A few points.  \n is included in the \s character class.  Braces don't
have to be back-slashed in a character class.  You should arrange the
patterns with the longest before the shortest if the shorter one(s) are
a subset of the longer one(s).  With the (})|(\\}) in your pattern, the
last one will never match.  And finally, you don't need parens around
each alternation.  So your split could be simplified to:

my @tokens = split /({\\[^\s}{]+|\\[^\s\\}]+|\\[\\}]|})/, $line;


HTH

John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to