Hi All, If you attended the Open Repositories 2015 (or followed along remotely), you may have heard about the "Indexing Repositories: Pitfalls and Best Practices" talk given by Anurag Acharya (co-creator of Google Scholar.
If you haven't yet seen the talk, the slides are available at: http://www.or2015.net/wp-content/uploads/2015/06/or-2015-anurag-google-scholar.pdf The video should be available from the OR15 website in the coming weeks. One of the common indexing "pitfalls" mentioned by Anurag was automatically inserting PDF Cover Pages into PDFs. From what I can recall, there's a few reasons this can be problematic: 1. Google Scholar (and possibly other search engines) attempts to extract metadata from the text of PDF (using some language processing and format identification techniques). This metadata includes auto-extracting title, abstract and author information from PDFs. Unfortunately, the addition of a PDF coverpage often breaks this metadata extraction, which may result in the document not appearing in Google Scholar. 2. If all the PDF cover pages in your site look nearly identical (or completely identical), the Google Scholar indexer (and again possibly others) may wrongly flag the site for "cloaking" [1]. Essentially, it detects something is "fishy" as all the documents look very similar. This may result in the removal of the entire site from Google Scholar. So, to get to my question. In DSpace 5.0, we actually added a basic PDF Cover Page capability (which was requested by DCAT and others): https://wiki.duraspace.org/display/DSDOC5x/PDF+Citation+Cover+Page As this may have strong implications for inclusion in Google Scholar, should we consider removing this functionality from DSpace? For the time being, I've placed warnings in the Documentation for this feature to try to dissuade institutions from enabling it if Google Scholar inclusion is of high importance. This isn't really a technical issue (as we can easily remove code). But, I am interested in feedback from repository managers and users of DSpace to better inform our decisions on this feature going forward. Thanks, Tim [1] More on "cloaking", which can be a spamming technique to trick search engines (and is therefore actively blocked by many search engines): https://en.wikipedia.org/wiki/Cloaking -- Tim Donohue Technical Lead for DSpace & DSpaceDirect DuraSpace.org | DSpace.org | DSpaceDirect.org ------------------------------------------------------------------------------ _______________________________________________ Dspace-general mailing list Dspace-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-general