Please search yourself first! "scrape JSON from web" at the rseek.org site produced what appeared to be several relevant hits, especially this CRAN task view: https://cran.r-project.org/web/views/WebTechnologies.html
Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Feb 19, 2019 at 3:07 PM Drake Gossi <drake.go...@gmail.com> wrote: > Hello everyone, > > I will be using R to manipulate this data > < > https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064 > >. > Specifically, it's proposed changes to Title IX--over 11,000 publicly > available comments. So, the end goal is for me to tabulate each of these > 11,000 comments in a csv file, so I can begin to manipulate and visualize > the data. > > But I'm not there yet. I just put in for an API key and, while I have one, > I'm waiting for it to be activated. After that, though, I'm a little lost. > Do I need to scrape the comments from the site? Or does having the API > render that unnecessary? There is this interface > <https://regulationsgov.github.io/developers/console/> that works with the > API, but I don't know if, though it, I can get the data I need. I'm still > trying to figure out what JSON is. > > Or, if I have to scrape the comments, can I do that with R? I can't get a > straight answer from the python people. I can't tell if I need to do this > through beautiful soup or through scrapy (or even if I need to do it at > all, as I said...). The trouble with the comments is, they are each on > their own URL, so--and again this is assuming that I will have to scrape > them--I don't know how to code in order to grab all of the comments from > all of the URLs. > > I also am trying to figure out how to isolate the essence of the comments > in the html. From the python people, I've heard the following: > > scrapy fetch 'url' > will download the raw page you are interested in. And you can look at > the raw source code. Important to appreciate that what you see in the > browser is often processed in your browser before you see it. > > Of course, a scraper can do the same processing, but it's complicated. > So, start by looking at the raw source code. Maybe you can grab what you > need with simple parsing like Beautiful Soup does. Maybe you need to do > more. Scrapy is your friend. > > Beautiful soup is your friend here. It can analyze the data within > the html tags on your scraped page. But often javascript is used on > 'modern' web pages so the page is actually not just html, but > javascript that changes the html. For this you need another tool -- i > think one is called scrapy. Others here probably have experience with > that. > > I think part of my problem relates to that yellow part. I was saying things > like > > I think what I might be looking for is a div class = GIY1LSJIXD, since > that's where the hierarchy seems to taper off in the html for the comment > I'm looking to scrape. > > > What I'm trying to do here is, locate the comment in the html so I can tell > the request function to extract it. > > Any help anyone could offer here would be much appreciated. I'm very lost. > > Drake > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.