Re: [R] regex - extracting src url

Martin Morgan Tue, 22 Mar 2016 03:29:58 -0700


On 03/22/2016 12:44 AM, Omar André Gonzáles Díaz wrote:

Hi,I have a DF with a column with "html", like this:

<IMG SRC="
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?";
BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">


I need to get this:


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
?


I've got this so far:


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\";
BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement


With this is the code I've used:

carreras_normal$Impression.Tag..image. <-
gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
                                   ignore.case = T)



*But I still need to use get rid of this part:*


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement*


Thank you for your help.


You're querying an xml string, so use xpath, e.g., via the XML library

> as.character(xmlParse(y)[["//IMG/@SRC"]])

[1]"https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?";

`xmlParse()` translates the character string into an XML document. `[[`subsets the document to extract a single element. "//IMG/@SRC" followsthe xpath specification (this sectionhttps://www.w3.org/TR/xpath-31/#abbrev of the specification provides aquick guide) to find, starting from the 'root' of the document, a node,at any depth, labeled IMG containing an attribute labeled SRC.


A variation, if there were several IMG tags to be extracted, would be

  xpathSApply(xmlParse(y), "//IMG/@SRC", as.character)


Omar Gonzáles.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] regex - extracting src url

Reply via email to