Re: Updated LOD Cloud Diagram - what is the message?

Christian Bizer Mon, 18 Aug 2014 02:11:19 -0700

Hi Giovanni and all,


our goals in creating the new diagram were mostly empirical as we wanted to 
know ourselves and report how the Linked Open Data cloud has evolved since 2011.

 

So we don’t plan to push a specific message with the diagram, but I agree with 
you that the release of the diagram could be a good occasion for the community 
to discuss the possible messages/conclusions that one could draw from it and I 
would be happy if more people would comment on this.

 

My two cents to this discussion:

 

I think that it is hard to draw a single conclusion from the diagram, but that 
the conclusion depends on the specific requirements of each data consumer. 

 

If you build an application that requires DBpedia/YAGO/Freebase/UMBEL/Cyc-style 
general knowledge about entities, or you build an applications that requires 
geographic, live science, or linguistic data, the datasets can be quite useful 
for you and the fact that they are partly interlinked can save you quite some 
work as you need to invest less effort into integrating them yourself.

 

On the other hand, if you expect complete coverage of all datasets that are 
relevant to your domain of interest or perfect data quality and currency, the 
Web of Linked Data obviously does not deliver this yet and the question is of 
course if it will deliver this in the future.

 

Personally, I think it is quite interesting to compare the deployment of 
Microdata/RDFa/Microformats and Linked Data on the Web. We also investigated 
the deployment of Microdata/RDFa/Microformats  [1][2] and the comparison 
currently looks like this:

 

1.       The overall number of websites publishing Microdata/RDFa/Microformats 
is three orders of magnitude larger than the number of websites publishing 
Linked Data.

2.       Topic wise, Microdata/RDFa/Microformats markup covers products, 
reviews, businesses, addresses, events, people, job postings and recipes. While 
Linked Data covers much more specific data from domains such as e-government, 
libraries, life science, linguistics or geography. So there is not too much 
overlap between the data that is published using the two technologies.

3.       In the context of Microdata/RDFa/Microformats, data providers do not 
set links pointing at data items in other datasets. In the Linked Data context, 
data providers do set such links to a certain extend. Not setting links of 
course reduces the effort required for data publishers (you just need to add 
some semantic markup to the PHP template that renders your website and you are 
done). On the other hand without such links, using the data within applications 
is much more painful. For an example on how much effort it took to integrate 
some Microdata describing products from different websites, see [3] (we needed 
sophisticated information extraction techniques to generate features from the 
product names and descriptions and then sophisticated identity resolution 
techniques to guess which descriptions refer to the same product).

4.       The Microdata/RDFa/Microformats are very shallow with usually only 3 
or 4 attributes used to describe an entity and most interesting semantics only 
provided as free text (long product or job descriptions as text). In contrast, 
the data that is published as Linked Data is often much more structured 
(e-government, life science data, general-purpose KBs) and entities are 
described with more attributes (having kind of well-defined semantics) and is 
thus likely to enable more sophisticated applications.

 

Looking at this comparison, I think the empirical results nicely reflect the 
strengths of both technologies. Microdata/RDFa/Microformats aim at being a 
simple technology for annotating webpages that puts very little effort on 
webmasters in order to find wide-spread deployment (Guha made this point rather 
clear in his LDOW2014 keynote [4]). 

Linked Data on the other hand is a technology for sharing the data integration 
effort between data publishers and data consumers (the more effort publishers 
put into setting RDF links, the easier it becomes for data consumers to use the 
data). 

 

Thus, it makes sense that we see Linked Data adoption within communities that 
have an interest in making their data easy to use and thus are willing to 
invest effort into this, like libraries, government and science (with life 
science and language processing being the first communities adopting the 
technologies) and social networking.

 

ON the other hand it makes sense that we see wide-spread adoption of 
Microdata/RDFa/Microformats by communities that mostly want to push their data 
into Google applications in order to get more traffic/turnover for their 
sites/businesses and are thus not interested in linking to others (which are 
also likely their competitors).

 

Concerning your questions who did publish the datasets in the cloud (the data 
producers themselves or some third parties like interested hackers and other 
data enthusiasts), we did not investigate this in detail and I would be very 
happy if somebody else would do this. But my general feeling is that compared 
to 2011 more datasets are published by the actual data producers or parties 
close to them (for instance in the domains of e-government, libraries, or 
cross-domain knowledge bases).

 

This are my two cents to the overall discussion and I would be very happy to 
hear what others think about the message that can be drawn from the new diagram.

 

Cheers,

 

Chris

 

 

[1] 
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Bizer-etal-DeploymentRDFaMicrodataMicroformats-ISWC-InUse-2013.pdf

[2] 
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Meusel-etal-TheWDCMicrodataRdfaMicroformatsDataSeries-ISWC2014-rbds.pdf

[3] 
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/petrovski_bryl_bizer_deos2014.pdf

[4] 
http://events.linkeddata.org/ldow2014/slides/ldow2014_keynote_guha_schema_org.pdf

 

 

 

 

Von: Giovanni Tummarello [mailto:[email protected]] 
Gesendet: Sonntag, 17. August 2014 16:43
An: Christian Bizer
Cc: Linking Open Data
Betreff: Updated LOD Cloud Diagram - what is the message?

 

Chris hi, 

 

i would be interested in  discussing what is the message that will accompany 
this new version?

 

If i am not wrong there appear to be more bubbles than "last time here" so i 
wonder is the message that's going out with this diagram that  "adoption has 
increased" (e.g. as there were 200 and now there are 500)? 

 

if so, i do wonderif that is not misleading, based on this diagram alone.

 

For example how many of these are published by independent individuals or 
organizations (some IP technique might be handy here also)?  

 

That statusnet, gov.uk, bio2rdf etc has gone a bit more industrial and 
published plenty of dataset is good, but is that significative in evaluating a 
general data publishing technology? 

 

More interesting it would be: how many of these are private companies, not in 
the context of a publicly funded research projects? are there many that are 
just created by "hackers" or students just making a point? 

 

So many of the old datasets seem to have disappeared, what hapened to them? 

 

Are those that stayed alive really and used? (i see http://revyu.com who's 
biggest tag is "good beers from 2007" the year where it was used by people at 
the banff conference)

Is the usage really significant? (is see "apache" "o'reilly" - really?) 

 

So. bottom line. 

 

Sure one can say "hey we gave a definition and we're following it to create 
this diagram, everything else is out of the question".  

 

.. and sure it doesnt have to be YOU answeing all those questions above. (i 
guess your list of sites is public for other to investigate?).

 

I would however think it important that the message sent with this new diagram 
did its best to avoid being possibly misleading :) 

 

What are your thoughts?

Gio

 

 

 

On Fri, Aug 15, 2014 at 9:07 AM, Christian Bizer <[email protected]> wrote:

Hi all,

on July 24th, we published a Linked Open Data (LOD) Cloud diagram containing
"crawlable" linked datasets and asked the community to point us at further
datasets that our crawler has missed [1].

Lots of thanks to everybody that did respond to our call and did enter
missing datasets into the DataHub catalog [2].

Based on your feedback, we have now drawn a draft version of the LOD cloud
containing:
1.      the datasets that our crawler discovered
2.      the datasets that did not allow crawling
3.      the datasets you pointed us at.

The new version of the cloud altogether contains 558 linked datasets which
are connected by altogether 2883 link sets. As we were pointed at quite a
number of linguistic datasets [3], we added linguistic data as a new
category to the diagram.

The current draft version of the LOD Cloud diagram is found at:

http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/extendedLO 
<http://data.dws.informatik.uni-mannheim.de/lodcloud/2014/ISWC-RDB/extendedLODCloud/extendedCloud.png>
 
DCloud/extendedCloud.png

Please note that we only included datasets that are accessible via
dereferencable URIs and are interlinked with other datasets.

It would be great if you could check if we correctly included your datasets
into the diagram and whether we missed some link sets pointing from your
datasets to other datasets.

If we did miss something, it would be great if you could point us at what we
have missed and update your entry in the DataHub catalog [2] accordingly.

Please send us feedback until August 20th. Afterwards, we will finalize the
diagram and publish the final August 2014 version.

Cheers,

Chris, Max and Heiko

--
Prof. Dr. Christian Bizer
Data and Web Science Research Group
Universität Mannheim, Germany 
[email protected]
www.bizer.de

Re: Updated LOD Cloud Diagram - what is the message?

Reply via email to