On Sat, Jul 13, 2002 at 10:57:04PM -0400, Tanton Gibbs wrote:
>
> 
> I'm not exactly sure what the problems are; however, here are a couple of
> things to try
> 1.) If you don't need to save the value of each of the subexpressions, then
> tell perl so by using ?: after each opening paren.

Once I tokenize the text, I don't use regular expressions at all.

Here is an example of an rtf line:


\pard \s1\fi720 \ldblquote Big guy, I didn\rquote t expect to see you so early,
\rdblquote  Joe said. \par

Here are the tokens:

'\pard':'\s1':'\fi720':'\ldblquote':' Big guy, I didn':
'\rquote': t expect to see you so early,' :''\rdblquote' :
'  Joe said. ' :'\par' 

Each of the escaped sequences represents some type of info that I
have to decide what to do with. I use the substr function to
determine the nature of the token.

Actually, I have simplified the list of tokens. My split
function actually produced 31 empty ('') tokens for this one
line. So perl is doing a lot of searching.


> 2.) Usually alternation is much slower than doing separate
> regexes...however, in your case separating the regexes is seemingly
> impossible.

I'm not sure what alternation is. But now I am thinking that
regexes are really not at all impossible. Perhaps they require a
little more thought. That's not what stopped me from using them.
I thought that I as I encoutnered more complex rtf, with
different (and insidious versions) of word, I would have have to
tweak my code so much that I wouldn't be able to maintain it.
However, on second thought, I don't think the problem is that
complicated.

Let's take a look at the line above. It starts with "pard" this
means "start a paragraph with a new style." the style names are
stored in the escaped sequences afterwords. So this style name is
"\s1 (stlye 1), \fi720" The fi means "first indent by 36 pts."
There are a zillion other tokens, all of which I don't
understand. What I need to know is when the text starts. I could
just look for non-escaped text. But the '\ldblquote' actually
marks the start of the text because it means "left quote." 

You can start to see some of the complexities and why I thought
it better to handle one token at a time. However, I was just
playing around with perl. I substituted every instance of
\ldbquote and 4 other control sequences (right quote, em-dash,
tab, and right curly). That only took 4 seconds for a 1.8
megabyte documents.

So I am thinking of doing the simple substitutions first, and
then proceeding. For example, if I substitute

/\\ldblquote/<lft_quote/>/g;
/\\rdblquote/<rt_quote/>/g;

then my line looks like this:

\pard \s1\fi720 <lft_quote/> Big guy, I didn\rquote t expect to see you so early,
<rt_quote/>  Joe said. \par 

now I can substitute:

s/\\pard(.*?)\s[^\\]/<para style=\"$1\">/;      # pard, followed by a
                                                                        #space, 
followed by
                                                                        #any character 
that 
                                                                        # is not a 
backslash


The most difficult part will be dealing with footnotes. They look
something like this:

{\footnote \pard \fi720 {\i italics word} text {\b bold words}}

This line contains a nested structure, and I have to determine
when it ends, because the paragraph styles are independent of the
styles in the main body. For this I will have to use //g as you
suggested, and keep counting the open and closed brackets until
they equal zero.

One last note on why I think I can change my strategy. an rtf
line can look like this:

\pard He was reading {
\i The Sun Also Rises} when he heard the dog bark.\par

This line should look like 

\pard He was reading {\i The Sun Also Rises} ...

In other words, rtf is so scrwed up, that it even splits tokens
across lines. However, I just read the Perl Cook book and realize
I can do this:

$\ = "\\par;
<read in each line>
s/\n//g;        # get rid of line endings. This will work. The only line
                # line ending should come at the \par delimter

Also, rtf does this 

\pard {i The Sun Also Rises \par
}

I have to read the whole file in and swith it so it reads:

\pard {i The Sun Also Rises} \par

I tried this on my big document, and it took only 4 tenths of a
second.

In sum, I am thinking that the regex are so super fast in perl
that it I choose carefully what to substitute first, I can parse
my document much faster.

Thanks!


-- 

************************
*Paul Tremblay         *
*[EMAIL PROTECTED]*
************************

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to