[CODE4LIB] Best practices for improving metadata extractability from journal articles?

Jason Best Tue, 26 Mar 2019 09:53:58 -0700

Hello,
I’m working with our journal to improve the quality of the metadata that can be 
extracted from PDFs of individual journal articles by reference management 
software like Zotero, Mendeley, EndNote, etc. The only description I’ve found 
of the metadata extraction process is from Zotero 
(https://www.zotero.org/support/retrieve_pdf_metadata) which "sends the first 
few pages of a PDF to the web service, which uses a variety of extraction 
algorithms and known metadata from CrossRef, paired with DOI and ISBN lookups, 
to build a parent item for the PDF”. What I haven’t found yet is a description 
of how to format the text of a PDF to ensure that the article metadata can be 
reliably extracted by reference managers. Most of these journal articles were 
published before we were issuing DOIs (or even before DOIs existed) so I’ll be 
adding a cover page to all the PDFs with title, authors, issue, pages, doi 
(issued retroactively), issn, etc. I’d like to format these pages in a way that 
ensures optimal extraction of metadata emphasizing of course the DOI and ISSN. 
In my experience, Mendeley can sometimes extract the article metadata fairly 
well even without a DOI lookup so I’d to aim for a format that is easily 
parsable in this way and not 100% relying on a DOI lookup. Does anyone have any 
experience or suggestions on how to craft such a page to work well across 
different reference managers?


Regards,
Jason

Jason Best
Director of Biodiversity Informatics
Botanical Research Institute of Texas
1700 University Drive
Fort Worth, Texas 76107

817-332-4441 ext. 230
http://www.brit.org

[CODE4LIB] Best practices for improving metadata extractability from journal articles?

Reply via email to