Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Mark H. Wood Fri, 19 Jun 2015 08:30:38 -0700

On Fri, Jun 19, 2015 at 07:52:40AM -0700, Mark Diggory wrote:
> Putting all this "tail wagging the dog" aside. I think it would be very
> good to get the appropriate "metadata" added to the PDF.
> 
> I wanted to contribute that we recently had a "non-coverpage" case where
> the title of a paper was correct in the first page of the pdf and in the
> DSpace metadata, but the PDF had the incorrect title in its internal
> metadata. This caused Google Scholar to show the incorrect title in its
> search results, which caused much confusion for the owner of that document.
> Changing the metadata resulted in the GS record changing. From this point,
> it is clear the GS is leaning heavily on PDF internal metadata as is
> primary source for its records.
> 
> I think that if the appropriate metadata were populated in the pdf process,
> that it would take precedence over the cover page in GS.


Hear, hear.  Having correct, complete machine-readable metadata in the
document itself is a Good Thing.

Researcher:  if you do this yourself, it's in your interest to ensure
that you do it well.  If you have an assistant to take care of such
things, it's in your interest to ensure that your assistant knows how
to do it well.  If you depend on Google Scholar or something like it,
you (all) get out of it what you (all) put into it.

The notion of a repository doing this automatically, whether
machine-readably or by generated cover pages, leads to some
interesting corner cases.  If the title page, repo. metadata, and
document metadata disagree, which one is correct?  If the document
contains poor-quality metadata, but it does contain them, then should
the repo. *replace* them with corrected values?  On the other end of
the ingestion process, what if we *extract* metadata from the document
and then have to correct them? do we fix the document?  And regardless
of how much we trust our own process, will search engines trust our
repo., the document metadata, or their own heuristic fishing in the
first page?

To gather some ideas, we might want to see what commercial publishers
do about these issues.  (Oh, boy: what if an academic repo. and a
publisher make *different* adjustments to document metadata?  Can we
get repo.s, publishers, and researchers to agree on priorities and a
process for polishing and harmonizing document metadata?)

I think that, in the end, all parties want "the best we can reasonably
do."  But how do we get there?

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

signature.asc
Description: Digital signature

------------------------------------------------------------------------------

_______________________________________________
Dspace-general mailing list
Dspace-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-general

Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Reply via email to