Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Mark Diggory Fri, 19 Jun 2015 10:17:35 -0700

List,

 would also recommend re-evaluating the coverpage contribution from a
Repository Maintainers perspective. There are several concerns with cover
pages that I recommend discussing. I identify these below:

a.) *Preservation Concern: *Many slightly different copies create a
situation where the original is difficult to identify by preservation
systems. Systems such as LOCKSS rely on crawling and checking last
modified, checksums and file sizes to determine if the file has changed and
thus may require updating in a preservation system. Systems that rely on
OAI-PMH Harvesting and transfer of the bitstream would also struggle with
differences in size and checksum if they are monitoring differences between
METS/ORE technical metadata and the retrieved file.

b.) *Discovery and Ranking Concern: *Disseminated copies of a PDF that are
dynamically different in checksum and size across all downloads. It becomes
difficult for consumers of content to tell they have the "same content" as
was originally downloaded, impacting ranking and inclusion of content into
search services. If a system like Google identified identical files on the
web based on checksum/size, it would not be able to isolate the files with
coverpages containing different download timestamps. In this case its not
the coverpage that is the problem, it is the "dynamically changing"
coverpage that is the concern.

c.) *Performance Concern: *Dynamically and repeatedly parsing a PDF file
into a in-memory model to attach a coverpage, and the serializing the model
back to disk before finally streaming it to the browser does have a
significant impact on CPU processing, disk I/O and RAM memory usage in
comparison to returning the original bitstream. This performance degrades
with the size of the PDF until files that are larger than available RAM
cripple the system with memory errors. Couple this with increased crawler
load due to checksum and size changes, and hardware requirements to
maintain the system will be significantly larger and uncapped.

d.) *Architectural Concern: *DSpace was originally designed under the
assumption that Bitstreams were to be guaranteed immutable. This can be
extended to mean that the same bitstream url
(/bitstream/handle/[handle]/[sequence-id]/file.ext) should always return
the same content and report the same size and checksum. This is something
that the current coverpage contribution does not maintain. Size and
Checksum of the Content will vary each time the PDF is retrieved from the
URL.

----

I prefer, and have recommended to our clients, an approach to resolve these
above issues. The approach preserves a hardcopy of the generated pdf with
the coverpage in the Item. This requires that the original pdf be moved to
a "PRESERVATION" bundle and the new copy with coverpage is added as a
replacement in the "CONTENT" bundle.

This assures a number of benefits:

1.)* Provenance: *a record of the changes is recorded in DSpace Item
metadata (or version history in some cases) and alterations to the Item and
its Bitstreams are trackable in the the system (either through provenance
metadata, version history, event auditing, system logs or statistics)

2.)* Immutability: *Bitstreams no longer "dynamically" change on download,
timestamps, checksums and file sizes are again a constant. Bitstream URL
again return immutable content that matches checksum and size communicated
technical metadata.

3.) *Performance: *RAM, I/O and CPU usage are close to constant for all
bitstream downloads, regardless of use of the coverpage.

I hope these points will be used to improve the existing cover page
features, It is not my intent to paint the current implementation in a bad
light. But to rather identify unbiased weaknesses of the current approach
that the community should be made aware of.

Kind Regards,
Mark

p.s. I note, there is a similar approach to my recommendation present in
the Coverpage contribution, via the CitationPage curation task
<https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/curate/CitationPage.java#L55>.
However, I'm not currently recommending it, as it is incomplete and does
not make the new pdf, with their cover pages, available in the "CONTENT"
bundle nor does it have any mechanism to return the covered bitstreams in
the UI from the separate bundle it uses.

On Fri, Jun 19, 2015 at 8:45 AM, Matveyeva, Susan <
susan.matvey...@wichita.edu> wrote:

> Sorry for another intrusion from the trenches.  Our full text processing
> guidelines include entering basic bibliographic data (author, title, date,
> subject or keyword; no abstract) to the .PDF internal metadata.  My opinion
> that this metadata is part of the repository staff workflow (even if author
> provided some metadata, it should be checked and edited for consistency.)
>
> Susan
> __________________________
> Susan Matveyeva, PhD, MLIS, B.Mus
> Associate Professor, Catalog &
> Institutional Repository Librarian
> Wichita State University Libraries
> 1845 Fairmount, Wichita, KS 67260-0068
>
> Office: (316) 978-5139
> Fax: (316) 978-3496
> susan.matvey...@wichita.edu
> http://soar.wichita.edu
>
>
>
> -----Original Message-----
> From: Mark H. Wood [mailto:mw...@iupui.edu]
> Sent: 19 June 2015 10:30
> To: dspace-general@lists.sourceforge.net
> Subject: Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search
> Engine inclusion implications
>
> On Fri, Jun 19, 2015 at 07:52:40AM -0700, Mark Diggory wrote:
> > Putting all this "tail wagging the dog" aside. I think it would be
> > very good to get the appropriate "metadata" added to the PDF.
> >
> > I wanted to contribute that we recently had a "non-coverpage" case
> > where the title of a paper was correct in the first page of the pdf
> > and in the DSpace metadata, but the PDF had the incorrect title in its
> > internal metadata. This caused Google Scholar to show the incorrect
> > title in its search results, which caused much confusion for the owner
> of that document.
> > Changing the metadata resulted in the GS record changing. From this
> > point, it is clear the GS is leaning heavily on PDF internal metadata
> > as is primary source for its records.
> >
> > I think that if the appropriate metadata were populated in the pdf
> > process, that it would take precedence over the cover page in GS.
>
> Hear, hear.  Having correct, complete machine-readable metadata in the
> document itself is a Good Thing.
>
> Researcher:  if you do this yourself, it's in your interest to ensure that
> you do it well.  If you have an assistant to take care of such things, it's
> in your interest to ensure that your assistant knows how to do it well.  If
> you depend on Google Scholar or something like it, you (all) get out of it
> what you (all) put into it.
>
> The notion of a repository doing this automatically, whether
> machine-readably or by generated cover pages, leads to some interesting
> corner cases.  If the title page, repo. metadata, and document metadata
> disagree, which one is correct?  If the document contains poor-quality
> metadata, but it does contain them, then should the repo. *replace* them
> with corrected values?  On the other end of the ingestion process, what if
> we *extract* metadata from the document and then have to correct them? do
> we fix the document?  And regardless of how much we trust our own process,
> will search engines trust our repo., the document metadata, or their own
> heuristic fishing in the first page?
>
> To gather some ideas, we might want to see what commercial publishers do
> about these issues.  (Oh, boy: what if an academic repo. and a publisher
> make *different* adjustments to document metadata?  Can we get repo.s,
> publishers, and researchers to agree on priorities and a process for
> polishing and harmonizing document metadata?)
>
> I think that, in the end, all parties want "the best we can reasonably
> do."  But how do we get there?
>
> --
> Mark H. Wood
> Lead Technology Analyst
>
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Dspace-general mailing list
> Dspace-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-general
>

-- 
[image: @mire Inc.]
*Mark Diggory*
*2888 Loker Avenue East, Suite 315, Carlsbad, CA. 92010*
*Esperantolaan 4, Heverlee 3001, Belgium*
http://www.atmire.com

------------------------------------------------------------------------------

_______________________________________________
Dspace-general mailing list
Dspace-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-general

Re: [Dspace-general] PDF Cover Pages & Google Scholar - Search Engine inclusion implications

Reply via email to