On 3 June 2015 at 19:42, Benjamin Francis <bfran...@mozilla.com> wrote:

> This is what I'd really like to get more of, particularly usage data.
>

I've reached out to a few people at Yahoo, Google and a couple of
universities and have managed to turn up a few studies with useful data
[1][2][3][4].

My conclusions so far are:

   - Microformats are used on a large number of web sites but are limited
   by their case by case syntax and more fixed vocabulary and are less
   formally defined.
   - Microdata and RDFa are vocabulary agnostic which makes them inherently
   more extensible, they're increasing in popularity due to schema.org and
   consumption by major search engines, whilst the use of Microformats has
   remained relatively constant over time.
   - Microdata is a bit more concise than RDFa but doesn't allow for the
   mixing of vocabularies.
   - Open Graph is a simplistic form of RDFa with a limited vocabularly and
   limited usefulness in comparison to other formats, but is very widely used
   due to Facebook and Twitter being major consumers.
   - Microformats is used by more websites (domains) but Microdata is used
   by more web pages (more URLs, more typed entities and more triples) and is
   growing the fastest. Microformats has the breadth, but Microdata has the
   depth. In our case I think what we care about is the latter - the amount of
   pinnable content.
   - JSON-LD is the newest format, the main difference being that it isn't
   intended to be embedded in with HTML markup, but is included separately in
   a script tag. It's also useful as a canonical JSON-based format to
   represent all of the other formats.

That leads me to recommend that we do the following:

   - Parse Microdata and RDFa (including Open Graph) from web pages in Gecko
   - Expose all of this data to Gaia via a single getLinkedData() or
   getStructuredData() method on the Browser API which returns a Promise that
   resolves with the data in a canonical JSON-LD format
   - Also consider supporting JSON-LD directly as no parsing is required,
   we just need to detect a script tag

If anyone finds any more usage data, or has a different interpretation of
the data below, then please do share.

Thanks

Ben

   1. Web Data Commons website based on Common Crawl corpus (2009-2014)
   http://webdatacommons.org/
   2. Web Data Commons Paper based on Common Crawl Corpus (2009-2012)
   http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-2.pdf
   3. Yahoo post based on Yahoo corpus (2011)

   https://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/
   4. Yahoo paper based on Bing corpus (2012)
   http://events.linkeddata.org/ldow2012/papers/ldow2012-inv-paper-1.pdf
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to