Hi All,

If you attended the Open Repositories 2015 (or followed along remotely), 
you may have heard about the "Indexing Repositories: Pitfalls and Best 
Practices" talk given by Anurag Acharya (co-creator of Google Scholar.

If you haven't yet seen the talk, the slides are available at:
http://www.or2015.net/wp-content/uploads/2015/06/or-2015-anurag-google-scholar.pdf

The video should be available from the OR15 website in the coming weeks.

One of the common indexing "pitfalls" mentioned by Anurag was 
automatically inserting PDF Cover Pages into PDFs. From what I can 
recall, there's a few reasons this can be problematic:

1. Google Scholar (and possibly other search engines) attempts to 
extract metadata from the text of PDF (using some language processing 
and format identification techniques). This metadata includes 
auto-extracting title, abstract and author information from PDFs. 
Unfortunately, the addition of a PDF coverpage often breaks this 
metadata extraction, which may result in the document not appearing in 
Google Scholar.

2. If all the PDF cover pages in your site look nearly identical (or 
completely identical), the Google Scholar indexer (and again possibly 
others) may wrongly flag the site for "cloaking" [1]. Essentially, it 
detects something is "fishy" as all the documents look very similar. 
This may result in the removal of the entire site from Google Scholar.

So, to get to my question. In DSpace 5.0, we actually added a basic PDF 
Cover Page capability (which was requested by DCAT and others):
https://wiki.duraspace.org/display/DSDOC5x/PDF+Citation+Cover+Page

As this may have strong implications for inclusion in Google Scholar, 
should we consider removing this functionality from DSpace?

For the time being, I've placed warnings in the Documentation for this 
feature to try to dissuade institutions from enabling it if Google 
Scholar inclusion is of high importance.

This isn't really a technical issue (as we can easily remove code). But, 
I am interested in feedback from repository managers and users of DSpace 
to better inform our decisions on this feature going forward.

Thanks,

Tim

[1] More on "cloaking", which can be a spamming technique to trick 
search engines (and is therefore actively blocked by many search 
engines): https://en.wikipedia.org/wiki/Cloaking

-- 
Tim Donohue
Technical Lead for DSpace & DSpaceDirect
DuraSpace.org | DSpace.org | DSpaceDirect.org

------------------------------------------------------------------------------
_______________________________________________
Dspace-general mailing list
Dspace-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-general

Reply via email to