Paul Tremblay wrote:
>
> I just finished my first version of a script that converts rtf to
> xml and was wondering if I went about writing it the wrong way.
>
> My method was to read in one line at a time and split the lines
> into tokens, and then to read one token at a time. I used this
> line to split up the text:
>
> @tokens = split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|(\\\\)|(})|(\\})/,$line);
>
> Splitting up the text on my test file of 1.8 megabytes tooks 25
> seconds. The entire script took 50 seconds.
A few points. \n is included in the \s character class. Braces don't
have to be back-slashed in a character class. You should arrange the
patterns with the longest before the shortest if the shorter one(s) are
a subset of the longer one(s). With the (})|(\\}) in your pattern, the
last one will never match. And finally, you don't need parens around
each alternation. So your split could be simplified to:
my @tokens = split /({\\[^\s}{]+|\\[^\s\\}]+|\\[\\}]|})/, $line;
HTH
John
--
use Perl;
program
fulfillment
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]