I just finished my first version of a script that converts rtf to
xml and was wondering if I went about writing it the wrong way.

My method was to read in one line at a time and split the lines
into tokens, and then to read one token at a time. I used this
line to split  up the text:
 
@tokens = split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|(\\\\)|(})|(\\})/,$line);

Splitting up the text on my test file of 1.8 megabytes tooks 25
seconds. The entire script took 50 seconds. 

I had written a previous uncompleted version in which I relied on
regular expressions rather than tokens, and this script took only
10 seconds to run. I gave up on this method because it seemed
there would always be an excpetion that would require another
regexp.

So why does splitting a text into tokens take so long? Has
anybody done something similar to what I am trying, and do you
have any advice? 

The good news is that relativley speaking, perl is very, very
fast. I tried a similar script in python using a lexer called
plex, and the 1.8 megabyte file took 12 minutes to parse!

In case you are wondering why I'm seemingly obsessed with speed,
I would like to make this script available to anyone. Right now
the only free utilities for converting rtf to xml are a java
utility call majix, which deletes your footnotes and only allows
for 9 user-defined styles. If my perl script is too slow, it won't be
very useful.

Thanks

Paul


-- 

************************
*Paul Tremblay         *
*[EMAIL PROTECTED]*
************************

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to