Hi All, As a follow up to this thread, the video of Anurag Acharya's "Indexing Repositories: pitfalls and best practices" talk at Open Repositories 2015 is now available at:
https://media.dlib.indiana.edu/media_objects/avalon:16122 His discussion of the problems around PDF Cover Pages starts around 34 minutes in and touches on it again around 37 minutes in. But, I would recommend watching the entire talk if Google / Google Scholar inclusion is of high interest to you at your institution. - Tim On 6/19/2015 9:17 AM, Tim Donohue wrote: > Hi All, > > First off, I just wanted to thank everyone for their thoughts, ideas, > etc. > > It's obvious that this is a very "hot button" topic in the DSpace > community. My goal was to get the discussion started now, so that we > can determine a way forward. > > So, I'd encourage additional feedback on this topic. As of yet, there > is no decision to remove this feature from DSpace. My goal in bringing > this up is to ensure we are making a "well informed" decision on the > benefits & detriments of PDF Cover Pages (and ensuring all of us are > aware of both sides of the argument here). > > While Google Scholar is not the only scholarly search engine out > there, it is one of the ones I hear about most frequently from > researchers and repository managers. At the very least, we should take > into consideration this feedback from Google Scholar, as it definitely > could have an effect on the visibility of DSpace PDFs in GS. So, at a > minimum, this should help us to provide more informative warnings > about some of the possible detriments of PDF cover pages. > > If DCAT (DSpace Community Advisory Team) is interested in re-visiting > this, it also may make for a good discussion at one of your monthly > calls. > > Thanks again all. Please do feel free to keep sending feedback! > > - Tim > > On 6/18/2015 11:23 AM, Tim Donohue wrote: >> Hi All, >> >> If you attended the Open Repositories 2015 (or followed along remotely), >> you may have heard about the "Indexing Repositories: Pitfalls and Best >> Practices" talk given by Anurag Acharya (co-creator of Google Scholar. >> >> If you haven't yet seen the talk, the slides are available at: >> http://www.or2015.net/wp-content/uploads/2015/06/or-2015-anurag-google-scholar.pdf >> >> >> >> >> The video should be available from the OR15 website in the coming weeks. >> >> One of the common indexing "pitfalls" mentioned by Anurag was >> automatically inserting PDF Cover Pages into PDFs. From what I can >> recall, there's a few reasons this can be problematic: >> >> 1. Google Scholar (and possibly other search engines) attempts to >> extract metadata from the text of PDF (using some language processing >> and format identification techniques). This metadata includes >> auto-extracting title, abstract and author information from PDFs. >> Unfortunately, the addition of a PDF coverpage often breaks this >> metadata extraction, which may result in the document not appearing in >> Google Scholar. >> >> 2. If all the PDF cover pages in your site look nearly identical (or >> completely identical), the Google Scholar indexer (and again possibly >> others) may wrongly flag the site for "cloaking" [1]. Essentially, it >> detects something is "fishy" as all the documents look very similar. >> This may result in the removal of the entire site from Google Scholar. >> >> So, to get to my question. In DSpace 5.0, we actually added a basic PDF >> Cover Page capability (which was requested by DCAT and others): >> https://wiki.duraspace.org/display/DSDOC5x/PDF+Citation+Cover+Page >> >> As this may have strong implications for inclusion in Google Scholar, >> should we consider removing this functionality from DSpace? >> >> For the time being, I've placed warnings in the Documentation for this >> feature to try to dissuade institutions from enabling it if Google >> Scholar inclusion is of high importance. >> >> This isn't really a technical issue (as we can easily remove code). But, >> I am interested in feedback from repository managers and users of DSpace >> to better inform our decisions on this feature going forward. >> >> Thanks, >> >> Tim >> >> [1] More on "cloaking", which can be a spamming technique to trick >> search engines (and is therefore actively blocked by many search >> engines): https://en.wikipedia.org/wiki/Cloaking >> ------------------------------------------------------------------------------ Monitor 25 network devices or servers for free with OpManager! OpManager is web-based network management software that monitors network devices and physical & virtual servers, alerts via email & sms for fault. Monitor 25 devices for free with no restriction. Download now http://ad.doubleclick.net/ddm/clk/292181274;119417398;o _______________________________________________ Dspace-general mailing list Dspace-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-general