Re: HTML to Text

JupiterHost.Net Wed, 03 May 2006 13:37:50 -0700

And why not post an example of your catch to illustrate it for thebenefit of the list?
Because I was busy and I knew you would do it ;-)


Hee hee, yeah true enough :)

But if you know "this exact block of HTML", how about:

my @strings = ( "string 1", "string 2", ... );
Because most likeley the string he is trying to grab will be changing,why else woul he be trygin to parse them out?



Right.  So in fact you don't know "this exact block of HTML".  Now,
based on what we do know your solution will probably work.  But we don't
really know too much.  What if "this exact block of HTML" contained

  <p>h<!--</p>-->a</p>

for example?  Yeah, I know, that'll never happen.

No it probably would happen, so with the regex as is you'd get 'h<!--'in that case.

The thing is the OP seemed to be sure that the <p></p> would be one pairon a single line (according to readmind())

Thats where his posting a complete model of his needs would have beeninfinitely more workable and have avoided such wanton recommendations ;)

So I guess the moral is that it will work as expected assuming you cangaurantee that your PP will always be on a line by itself with no P in it :)

Otherwise, you need to use some HTML parsing module since its will getcomplex quick

Or how about a solution involving "links -dump" ?


ATTN casual readers: *That is the worst idea ever* don't do it!


I'm not sure it's quite that bad.  I might have suggested using Java.


LOL, nice, I like you Paul you're a funny guy!

a) its not perl its a system command
b) its not portable by any means (what if "links" is not in theirpath? what if "links" isn't even installed, what if "links" should havebeen "lynx" what if the -dump flag on your OSs links needs to be --grabon their OSs links, etc etc ?)c) how does that help you get the string between the p tags in anyusefull form (IE you still have to get that data out of the output ofthat commandd) Hypothetical unknown behavior: what if it creates a temp file andis unable to remove it and it gets run a million times, now you'vepotentially filled up the user's quota, potentially filled up apartitian, etc etc
But what if chucking the output into a file does exactly what you want?

Then chuck it to a file (that was one reason I labeled it"hypothetical"), the point is that if a single regex or existing modulewill do what you need, use it instead of unnecessarily using externalstuff that would *still* most likely need to beparsed into useable data structure and could introduce all sorts ofunknown problems.

  Slavish adherence to portability concerns shouldn't get in the way of
your getting your job done. (perlfaq5)

I wouldn't call slavish the use of a built in tool (regex or modules)instead of an external call with all its potential problems especiallyif its not just a one time command.

And all because you didn't use Perl's most fundamental tool: regexs orone of the zillions of HTML parsing modules to get what you want into adata structure that is native to the script you want to use the data in.
In general parsing HTML with a regular expression is going to bite you.

Absolutley, but the OP insisited it was consisently like that and was sovague about it that the regex would do it AFAI understood his needs.

Thats also why I kept mentioning modules to do the parsing for you if itwas anythign but super basic.

You might find situations where it works, and I've even done so myself
(with XML rather than HTML) but I don't think anyone could call it
robust whilst keeping a straight face.


you're right I tried it and now have the giggles, thanks a lot paul! :)

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: HTML to Text

Reply via email to