maybe you should check this out:
http://search.cpan.org/search?dist=RTF-Tokenizer
http://search.cpan.org/search?dist=RTF-Parser

On Sat, 2002-07-13 at 19:35, Paul Tremblay wrote:
> I just finished my first version of a script that converts rtf to
> xml and was wondering if I went about writing it the wrong way.
> 
> My method was to read in one line at a time and split the lines
> into tokens, and then to read one token at a time. I used this
> line to split  up the text:
>  
> @tokens = split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|(\\\\)|(})|(\\})/,$line);
> 
> Splitting up the text on my test file of 1.8 megabytes tooks 25
> seconds. The entire script took 50 seconds. 
> 
> I had written a previous uncompleted version in which I relied on
> regular expressions rather than tokens, and this script took only
> 10 seconds to run. I gave up on this method because it seemed
> there would always be an excpetion that would require another
> regexp.
> 
> So why does splitting a text into tokens take so long? Has
> anybody done something similar to what I am trying, and do you
> have any advice? 
> 
> The good news is that relativley speaking, perl is very, very
> fast. I tried a similar script in python using a lexer called
> plex, and the 1.8 megabyte file took 12 minutes to parse!
> 
> In case you are wondering why I'm seemingly obsessed with speed,
> I would like to make this script available to anyone. Right now
> the only free utilities for converting rtf to xml are a java
> utility call majix, which deletes your footnotes and only allows
> for 9 user-defined styles. If my perl script is too slow, it won't be
> very useful.
> 
> Thanks
> 
> Paul
> 
> 
> -- 
> 
> ************************
> *Paul Tremblay         *
> *[EMAIL PROTECTED]*
> ************************
> 
> -- 
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to