See the parse_url function in the httr package. It does all this and more.
On Mar 6, 2014 2:45 PM, "Sarah Goslee" <sarah.gos...@gmail.com> wrote:

> There are many ways to do this. Here's a simple version and a slightly
> fancier version:
>
>
> url = c("http://www.mdd.com/food/pizza/index.html";,
> "http://www.mdd.com/build-your-own/index.html";,
> "http://www.mdd.com/special-deals.html";,
> "http://www.genius.com/find-a-location.html";,
> "http://www.google.com/hello.html";)
>
>
> url2 = c("http://www.mdd.com/food/pizza/index.html";,
> "https://www.mdd.com/build-your-own/index.html";,
> "http://www.mdd.edu/special-deals.html";,
> "http://www.genius.gov/find-a-location.html";,
> "http://www.google.com/hello.html";)
>
>
> parse1 <- function(x) {
>     # will work for https as well as http
>     x <- sub("^http[s]*:\\/\\/", "", x)
>     x <- sub("^www\\.", "", x)
>     strsplit(x, "/")[[1]][1]
> }
>
> parse2 <- function(x) {
>     # if you're sure it will always be .com
>     strsplit(x, "\\.com")[[1]][2]
> }
>
> parse2a <- function(x) {
>     # one way to split at any three-letter extension
>     # assumes !S! won't appear in the URLs
>     x <- sub("\\.[a-z]{3,3}\\/", "!S!\\/", x)
>     strsplit(x, "!S!")[[1]][2]
> }
>
> sapply(url, parse1)
> sapply(url, parse2)
>
> sapply(url2, parse1)
> sapply(url2, parse2a)
>
>
> Sarah
>
> On Thu, Mar 6, 2014 at 12:23 PM, Abraham Mathew <abmathe...@gmail.com>
> wrote:
> > Let's say that I have the following character vector with a series of url
> > strings. I'm interested in extracting some information from each string.
> >
> > url = c("http://www.mdd.com/food/pizza/index.html";, "
> > http://www.mdd.com/build-your-own/index.html";,
> >         "http://www.mdd.com/special-deals.html";, "
> > http://www.genius.com/find-a-location.html";,
> >         "http://www.google.com/hello.html";)
> >
> > - First, I want to extract the domain name followed by .com. After
> > struggling with this for a while, reading some regular expression
> > tutorials, and reading through stack overflow, I came up with the
> following
> > solution. Perfect!
> >
> >> parser <- function(x) gsub("www\\.", "", sapply(strsplit(gsub("http://
> ",
> > "", x), "/"), "[[", 1))
> >> parser(url)
> > [1] "mdd.com"    "mdd.com"    "mdd.com"    "genius.com" "google.com"
> >
> > - Second, I want to extract everything after .com in the original url.
> > Unfortunately, I don't know the proper regular expression to assign in
> > order to get the desired result. Can anyone help.
> >
> > Output should be
> > /food/pizza/index.html
> > build-your-own/index.html
> > /special-deals.html
> >
> > If anyone has a solution using the stringr package, that'd be of interest
> > also.
> >
> >
> > Thanks.
> >
>
> --
> Sarah Goslee
> http://www.functionaldiversity.org
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to