I have been following this thread, but there are many aspects of it which are unclear to me. Who are the publishers? Who are the users? What is the problem? I have a vauge sense for some of these, but it seems to me like one valuable starting place would be creating a document that clarifies everything. It is easier to tackle a concrete problem (e.g., agree on a standard numerical representation of dates and times a la ISO 8601) than something diffuse (e.g., information overload).
Good luck, Josh On Sat, Jan 14, 2012 at 10:02 AM, Benjamin Weber <m...@bwe.im> wrote: > Mike > > We see that the publishers are aware of the problem. They don't think > that the raw data is the usable for the user. Consequently they > recognizing this fact with the proprietary formats. Yes, they resign > in the information overload. That's pathetic. > > It is not a question of *which* data format, it is a question about > the general concept. Where do publisher and user meet? There has to be > one *defined* point which all parties agree on. I disagree with your > statement that the publisher should just publish csv or cook his own > API. That leads to fragmentation and inaccessibility of data. We want > data to be accessible. > > A more pragmatic approach is needed to revolutionize the way we go > about raw data. > > Benjamin > > On 14 January 2012 22:17, Mike Marchywka <marchy...@hotmail.com> wrote: >> >> >> >> >> >> >> >> LOL, I remember posting about this in the past. The US gov agencies vary but >> mostare quite good. The big problem appears to be people who push >> proprietary orcommercial "standards" for which only one effective source >> exists. Some formats,like Excel and PDF come to mind and there is a >> disturbing trend towards theiradoption in some places where raw data is >> needed by many. The best thing to do is contact the informationprovider and >> let them know you want raw data, not images or stuff that worksin limited >> commercial software packages. Often data sources are valuable andthe revenue >> model impacts availability. >> >> If you are just arguing over different open formats, it is usually easy for >> someone towrite some conversion code and publish it- CSV to JSON would not >> be a problem for example. Data of course are quite variable and there is >> nothingwrong with giving provider his choice. >> >> ---------------------------------------- >>> Date: Sat, 14 Jan 2012 10:21:23 -0500 >>> From: ja...@rampaginggeek.com >>> To: r-help@r-project.org >>> Subject: Re: [R] The Future of R | API to Public Databases >>> >>> Web services are only part of the problem. In essence, there are at >>> least two facets: >>> 1. downloading the data using some protocol >>> 2. mapping the data to a common model >>> >>> Having #1 makes the import/download easier, but it really becomes useful >>> when both are included. I think #2 is the harder problem to address. >>> Software can usually be written to handle #1 by making a useful >>> abstraction layer. #2 means that data has consistent names and meanings, >>> and this requires people to agree on common definitions and a common >>> naming convention. >>> >>> RDF (Resource Description Framework) and its related technologies >>> (SPARQL, OWL, etc) are one of the many attempts to try to address this. >>> While this effort would benefit R, I think it's best if it's part of a >>> larger effort. >>> >>> Services such as DBpedia and Freebase are trying to unify many data sets >>> using RDF. >>> >>> The task view and package ideas a great ideas. I'm just adding another >>> perspective. >>> >>> Jason >>> >>> On 01/13/2012 05:18 PM, Roy Mendelssohn wrote: >>> > HI Benjamin: >>> > >>> > What would make this easier is if these sites used standardized web >>> > services, so it would only require writing once. data.gov is the worst >>> > example, they spun the own, weak service. >>> > >>> > There is a lot of environmental data available through OPenDAP, and that >>> > is supported in the ncdf4 package. My own group has a service called >>> > ERDDAP that is entirely RESTFul, see: >>> > >>> > http://coastwatch.pfel.noaa.gov/erddap >>> > >>> > and >>> > >>> > http://upwell.pfeg.noaa.gov/erddap >>> > >>> > We provide R (and matlab) scripts that automate the extract for certain >>> > cases, see: >>> > >>> > http://coastwatch.pfeg.noaa.gov/xtracto/ >>> > >>> > We also have a tool called the Environmental Data Connector (EDC) that >>> > provides a GUI from with R (and ArcGIS, Matlab and Excel) that allows you >>> > to subset data that is served by OPeNDAP, ERDDAP, certain Sensor >>> > Observation Service (SOS) servers, and have it read directly into R. It >>> > is freely available at: >>> > >>> > http://www.pfeg.noaa.gov/products/EDC/ >>> > >>> > We can write such tools because the service is either standardized >>> > (OPeNDAP, SOS) or is easy to implement (ERDDAP). >>> > >>> > -Roy >>> > >>> > >>> > On Jan 13, 2012, at 1:14 PM, Benjamin Weber wrote: >>> > >>> >> Dear R Users - >>> >> >>> >> R is a wonderful software package. CRAN provides a variety of tools to >>> >> work on your data. But R is not apt to utilize all the public >>> >> databases in an efficient manner. >>> >> I observed the most tedious part with R is searching and downloading >>> >> the data from public databases and putting it into the right format. I >>> >> could not find a package on CRAN which offers exactly this fundamental >>> >> capability. >>> >> Imagine R is the unified interface to access (and analyze) all public >>> >> data in the easiest way possible. That would create a real impact, >>> >> would put R a big leap forward and would enable us to see the world >>> >> with different eyes. >>> >> >>> >> There is a lack of a direct connection to the API of these databases, >>> >> to name a few: >>> >> >>> >> - Eurostat >>> >> - OECD >>> >> - IMF >>> >> - Worldbank >>> >> - UN >>> >> - FAO >>> >> - data.gov >>> >> - ... >>> >> >>> >> The ease of access to the data is the key of information processing with >>> >> R. >>> >> >>> >> How can we handle the flow of information noise? R has to give an >>> >> answer to that with an extensive API to public databases. >>> >> >>> >> I would love your comments and ideas as a contribution in a vital >>> >> discussion. >>> >> >>> >> Benjamin >>> >> >>> >> ______________________________________________ >>> >> R-help@r-project.org mailing list >>> >> https://stat.ethz.ch/mailman/listinfo/r-help >>> >> PLEASE do read the posting guide >>> >> http://www.R-project.org/posting-guide.html >>> >> and provide commented, minimal, self-contained, reproducible code. >>> > ********************** >>> > "The contents of this message do not reflect any position of the U.S. >>> > Government or NOAA." >>> > ********************** >>> > Roy Mendelssohn >>> > Supervisory Operations Research Analyst >>> > NOAA/NMFS >>> > Environmental Research Division >>> > Southwest Fisheries Science Center >>> > 1352 Lighthouse Avenue >>> > Pacific Grove, CA 93950-2097 >>> > >>> > e-mail: roy.mendelss...@noaa.gov (Note new e-mail address) >>> > voice: (831)-648-9029 >>> > fax: (831)-648-8440 >>> > www: http://www.pfeg.noaa.gov/ >>> > >>> > "Old age and treachery will overcome youth and skill." >>> > "From those who have been given much, much will be expected" >>> > "the arc of the moral universe is long, but it bends toward justice" -MLK >>> > Jr. >>> > >>> > ______________________________________________ >>> > R-help@r-project.org mailing list >>> > https://stat.ethz.ch/mailman/listinfo/r-help >>> > PLEASE do read the posting guide >>> > http://www.R-project.org/posting-guide.html >>> > and provide commented, minimal, self-contained, reproducible code. >>> > >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.