I'm not exactly sure what the problems are; however, here are a couple of things to try 1.) If you don't need to save the value of each of the subexpressions, then tell perl so by using ?: after each opening paren. 2.) Usually alternation is much slower than doing separate regexes...however, in your case separating the regexes is seemingly impossible. 3.) Have you tried m//g and processing a token at a time instead of saving them all into an array?
Good luck! Tanton ----- Original Message ----- From: "Paul Tremblay" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, July 13, 2002 10:35 PM Subject: script too slow? > I just finished my first version of a script that converts rtf to > xml and was wondering if I went about writing it the wrong way. > > My method was to read in one line at a time and split the lines > into tokens, and then to read one token at a time. I used this > line to split up the text: > > @tokens = split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|(\\\\)|(})|(\\})/,$line); > > Splitting up the text on my test file of 1.8 megabytes tooks 25 > seconds. The entire script took 50 seconds. > > I had written a previous uncompleted version in which I relied on > regular expressions rather than tokens, and this script took only > 10 seconds to run. I gave up on this method because it seemed > there would always be an excpetion that would require another > regexp. > > So why does splitting a text into tokens take so long? Has > anybody done something similar to what I am trying, and do you > have any advice? > > The good news is that relativley speaking, perl is very, very > fast. I tried a similar script in python using a lexer called > plex, and the 1.8 megabyte file took 12 minutes to parse! > > In case you are wondering why I'm seemingly obsessed with speed, > I would like to make this script available to anyone. Right now > the only free utilities for converting rtf to xml are a java > utility call majix, which deletes your footnotes and only allows > for 9 user-defined styles. If my perl script is too slow, it won't be > very useful. > > Thanks > > Paul > > > -- > > ************************ > *Paul Tremblay * > *[EMAIL PROTECTED]* > ************************ > > -- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]