See the parse_url function in the httr package. It does all this and more. On Mar 6, 2014 2:45 PM, "Sarah Goslee" <sarah.gos...@gmail.com> wrote:
> There are many ways to do this. Here's a simple version and a slightly > fancier version: > > > url = c("http://www.mdd.com/food/pizza/index.html", > "http://www.mdd.com/build-your-own/index.html", > "http://www.mdd.com/special-deals.html", > "http://www.genius.com/find-a-location.html", > "http://www.google.com/hello.html") > > > url2 = c("http://www.mdd.com/food/pizza/index.html", > "https://www.mdd.com/build-your-own/index.html", > "http://www.mdd.edu/special-deals.html", > "http://www.genius.gov/find-a-location.html", > "http://www.google.com/hello.html") > > > parse1 <- function(x) { > # will work for https as well as http > x <- sub("^http[s]*:\\/\\/", "", x) > x <- sub("^www\\.", "", x) > strsplit(x, "/")[[1]][1] > } > > parse2 <- function(x) { > # if you're sure it will always be .com > strsplit(x, "\\.com")[[1]][2] > } > > parse2a <- function(x) { > # one way to split at any three-letter extension > # assumes !S! won't appear in the URLs > x <- sub("\\.[a-z]{3,3}\\/", "!S!\\/", x) > strsplit(x, "!S!")[[1]][2] > } > > sapply(url, parse1) > sapply(url, parse2) > > sapply(url2, parse1) > sapply(url2, parse2a) > > > Sarah > > On Thu, Mar 6, 2014 at 12:23 PM, Abraham Mathew <abmathe...@gmail.com> > wrote: > > Let's say that I have the following character vector with a series of url > > strings. I'm interested in extracting some information from each string. > > > > url = c("http://www.mdd.com/food/pizza/index.html", " > > http://www.mdd.com/build-your-own/index.html", > > "http://www.mdd.com/special-deals.html", " > > http://www.genius.com/find-a-location.html", > > "http://www.google.com/hello.html") > > > > - First, I want to extract the domain name followed by .com. After > > struggling with this for a while, reading some regular expression > > tutorials, and reading through stack overflow, I came up with the > following > > solution. Perfect! > > > >> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http:// > ", > > "", x), "/"), "[[", 1)) > >> parser(url) > > [1] "mdd.com" "mdd.com" "mdd.com" "genius.com" "google.com" > > > > - Second, I want to extract everything after .com in the original url. > > Unfortunately, I don't know the proper regular expression to assign in > > order to get the desired result. Can anyone help. > > > > Output should be > > /food/pizza/index.html > > build-your-own/index.html > > /special-deals.html > > > > If anyone has a solution using the stringr package, that'd be of interest > > also. > > > > > > Thanks. > > > > -- > Sarah Goslee > http://www.functionaldiversity.org > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.