Hi All,

As a follow up to this thread, the video of Anurag Acharya's "Indexing 
Repositories: pitfalls and best practices" talk at Open Repositories 
2015 is now available at:

https://media.dlib.indiana.edu/media_objects/avalon:16122

His discussion of the problems around PDF Cover Pages starts around 34 
minutes in and touches on it again around 37 minutes in. But, I would 
recommend watching the entire talk if Google / Google Scholar inclusion 
is of high interest to you at your institution.

- Tim

On 6/19/2015 9:17 AM, Tim Donohue wrote:
> Hi All,
>
> First off, I just wanted to thank everyone for their thoughts, ideas, 
> etc.
>
> It's obvious that this is a very "hot button" topic in the DSpace 
> community. My goal was to get the discussion started now, so that we 
> can determine a way forward.
>
> So, I'd encourage additional feedback on this topic. As of yet, there 
> is no decision to remove this feature from DSpace. My goal in bringing 
> this up is to ensure we are making a "well informed" decision on the 
> benefits & detriments of PDF Cover Pages (and ensuring all of us are 
> aware of both sides of the argument here).
>
> While Google Scholar is not the only scholarly search engine out 
> there, it is one of the ones I hear about most frequently from 
> researchers and repository managers. At the very least, we should take 
> into consideration this feedback from Google Scholar, as it definitely 
> could have an effect on the visibility of DSpace PDFs in GS. So, at a 
> minimum, this should help us to provide more informative warnings 
> about some of the possible detriments of PDF cover pages.
>
> If DCAT (DSpace Community Advisory Team) is interested in re-visiting 
> this, it also may make for a good discussion at one of your monthly 
> calls.
>
> Thanks again all. Please do feel free to keep sending feedback!
>
> - Tim
>
> On 6/18/2015 11:23 AM, Tim Donohue wrote:
>> Hi All,
>>
>> If you attended the Open Repositories 2015 (or followed along remotely),
>> you may have heard about the "Indexing Repositories: Pitfalls and Best
>> Practices" talk given by Anurag Acharya (co-creator of Google Scholar.
>>
>> If you haven't yet seen the talk, the slides are available at:
>> http://www.or2015.net/wp-content/uploads/2015/06/or-2015-anurag-google-scholar.pdf
>>  
>>
>>
>>
>> The video should be available from the OR15 website in the coming weeks.
>>
>> One of the common indexing "pitfalls" mentioned by Anurag was
>> automatically inserting PDF Cover Pages into PDFs. From what I can
>> recall, there's a few reasons this can be problematic:
>>
>> 1. Google Scholar (and possibly other search engines) attempts to
>> extract metadata from the text of PDF (using some language processing
>> and format identification techniques). This metadata includes
>> auto-extracting title, abstract and author information from PDFs.
>> Unfortunately, the addition of a PDF coverpage often breaks this
>> metadata extraction, which may result in the document not appearing in
>> Google Scholar.
>>
>> 2. If all the PDF cover pages in your site look nearly identical (or
>> completely identical), the Google Scholar indexer (and again possibly
>> others) may wrongly flag the site for "cloaking" [1]. Essentially, it
>> detects something is "fishy" as all the documents look very similar.
>> This may result in the removal of the entire site from Google Scholar.
>>
>> So, to get to my question. In DSpace 5.0, we actually added a basic PDF
>> Cover Page capability (which was requested by DCAT and others):
>> https://wiki.duraspace.org/display/DSDOC5x/PDF+Citation+Cover+Page
>>
>> As this may have strong implications for inclusion in Google Scholar,
>> should we consider removing this functionality from DSpace?
>>
>> For the time being, I've placed warnings in the Documentation for this
>> feature to try to dissuade institutions from enabling it if Google
>> Scholar inclusion is of high importance.
>>
>> This isn't really a technical issue (as we can easily remove code). But,
>> I am interested in feedback from repository managers and users of DSpace
>> to better inform our decisions on this feature going forward.
>>
>> Thanks,
>>
>> Tim
>>
>> [1] More on "cloaking", which can be a spamming technique to trick
>> search engines (and is therefore actively blocked by many search
>> engines): https://en.wikipedia.org/wiki/Cloaking
>>


------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical & virtual servers, alerts via email & sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
Dspace-general mailing list
Dspace-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-general

Reply via email to