maybe you should check this out:
http://search.cpan.org/search?dist=RTF-Tokenizer
http://search.cpan.org/search?dist=RTF-Parser
On Sat, 2002-07-13 at 19:35, Paul Tremblay wrote:
> I just finished my first version of a script that converts rtf to
> xml and was wondering if I went about writing it the wrong way.
>
> My method was to read in one line at a time and split the lines
> into tokens, and then to read one token at a time. I used this
> line to split up the text:
>
> @tokens = split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|(\\\\)|(})|(\\})/,$line);
>
> Splitting up the text on my test file of 1.8 megabytes tooks 25
> seconds. The entire script took 50 seconds.
>
> I had written a previous uncompleted version in which I relied on
> regular expressions rather than tokens, and this script took only
> 10 seconds to run. I gave up on this method because it seemed
> there would always be an excpetion that would require another
> regexp.
>
> So why does splitting a text into tokens take so long? Has
> anybody done something similar to what I am trying, and do you
> have any advice?
>
> The good news is that relativley speaking, perl is very, very
> fast. I tried a similar script in python using a lexer called
> plex, and the 1.8 megabyte file took 12 minutes to parse!
>
> In case you are wondering why I'm seemingly obsessed with speed,
> I would like to make this script available to anyone. Right now
> the only free utilities for converting rtf to xml are a java
> utility call majix, which deletes your footnotes and only allows
> for 9 user-defined styles. If my perl script is too slow, it won't be
> very useful.
>
> Thanks
>
> Paul
>
>
> --
>
> ************************
> *Paul Tremblay *
> *[EMAIL PROTECTED]*
> ************************
>
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]