On 3 June 2015 at 19:42, Benjamin Francis <bfran...@mozilla.com> wrote:
> This is what I'd really like to get more of, particularly usage data. > I've reached out to a few people at Yahoo, Google and a couple of universities and have managed to turn up a few studies with useful data [1][2][3][4]. My conclusions so far are: - Microformats are used on a large number of web sites but are limited by their case by case syntax and more fixed vocabulary and are less formally defined. - Microdata and RDFa are vocabulary agnostic which makes them inherently more extensible, they're increasing in popularity due to schema.org and consumption by major search engines, whilst the use of Microformats has remained relatively constant over time. - Microdata is a bit more concise than RDFa but doesn't allow for the mixing of vocabularies. - Open Graph is a simplistic form of RDFa with a limited vocabularly and limited usefulness in comparison to other formats, but is very widely used due to Facebook and Twitter being major consumers. - Microformats is used by more websites (domains) but Microdata is used by more web pages (more URLs, more typed entities and more triples) and is growing the fastest. Microformats has the breadth, but Microdata has the depth. In our case I think what we care about is the latter - the amount of pinnable content. - JSON-LD is the newest format, the main difference being that it isn't intended to be embedded in with HTML markup, but is included separately in a script tag. It's also useful as a canonical JSON-based format to represent all of the other formats. That leads me to recommend that we do the following: - Parse Microdata and RDFa (including Open Graph) from web pages in Gecko - Expose all of this data to Gaia via a single getLinkedData() or getStructuredData() method on the Browser API which returns a Promise that resolves with the data in a canonical JSON-LD format - Also consider supporting JSON-LD directly as no parsing is required, we just need to detect a script tag If anyone finds any more usage data, or has a different interpretation of the data below, then please do share. Thanks Ben 1. Web Data Commons website based on Common Crawl corpus (2009-2014) http://webdatacommons.org/ 2. Web Data Commons Paper based on Common Crawl Corpus (2009-2012) http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-2.pdf 3. Yahoo post based on Yahoo corpus (2011) https://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/ 4. Yahoo paper based on Bing corpus (2012) http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.pdf _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform