On Feb 10, 2019, at 9:38 PM, Conal Tuohy <[email protected]> wrote:

> You may have tried this already, but it seems that Hathi also offer PDF-
> and EBM-formatted data at the volume level. Do those formats include the
> OCR text? I have seen this done in PDF before (and I've done it myself):
> the files contain bitmap page images but the OCR text is also there, in a
> layer beneath the images.

Alas, programmatically downloading a PDF file with embedded OCR is not an 
option. The documentation [1] says a format called ebm is possible, and pdf or 
epub are for the future. After programmatically authenticating and submitting 
the following RESTful URLs, I get either error 400 ("invalid or missing format 
parameter value format=pdf") or 403 ("insufficient privilege")

  https://babel.hathitrust.org/cgi/htd/volume/uva.x000274833?format=pdf
  https://babel.hathitrust.org/cgi/htd/volume/uva.x000274833?format=ebm

Apparently, PDF is not supported (yet), and ebm is restricted to the Espressnet 
Project.


On Feb 11, 2019, at 9:52 AM, Angelina Zaytsev <[email protected]> wrote:

> As a University of Notre Dame librarian, you should be able to log into 
> HathiTrust by clicking the login button, selecting University of Notre Dame 
> from the dropdown menu, and then logging in with your Notre Dame username and 
> password. Once you do so, you'll be able to download the full pdf through the 
> user interface (the full pdf download option is not available to non-logged 
> in, non-member users) at https://hdl.handle.net/2027/uva.x000274833 . That 
> option should be easier for retrieving the full pdf than using the Data API. 
> 
> If you just need the plain text OCR for a few books, you may want to download 
> the pdfs from the user interface and use the "Export to" feature in Adobe to 
> save the OCR that's embedded within the pdf as a txt file. 
> 
> If you need the OCR for a larger number of volumes, then you may want to 
> consider requesting a dataset (see https://www.hathitrust.org/datasets ) or 
> using the HathiTrust Research Center services (see 
> https://analytics.hathitrust.org/ ). The datasets are more appropriate for 
> thousands of volumes.

The process outlined above is feasible for the analysis of a few documents, no 
more than a dozen and probably less. The process -- while functional -- is not 
feasible for the person who wants to study the complete works for X, all the 
things written in English during a particular decade, or any number of other 
subsets of the 'Trust. Nor is the English major, historian, or even typical 
librarian going to VPN into a virtual machine, work from the command line, 
invoke the secure environment, and then write Python scripts to do their good 
work. The learning curve is too high.


On Feb 11, 2019, at 1:04 PM, Kyle Banerjee <[email protected]> wrote:

> Haven't used the Hathi API before. Is multithreading possible or do
> tech/policy constraints make that approach a nonstarter or otherwise not
> worth pursuing?
> 
> Peeking the documentation, I noticed a htd:numpages element. If that is
> usable, it would prevent the need to rely on errors to detect the document
> end.

Multithreading? Yes, I've given that some thought. Now-a-days our computers 
have multiple cores. My computer at work is nothing very special, and it has 8 
cores. My computer at home as 4. When I bought them I had no idea they had more 
than one.  8-)  In the recent past I learned more about parallel processing, 
and yes, multithreading is an option; use each core to get a page, and when all 
pages are gotten, assemble the whole. Both cool & "kewl".

htd:numpages? I'd like to know more about this. I didn't see that element in 
the documentation, nor in the various metadata files I've downloaded.


The HathiTrust is such a rich resource, but it is not easy to use at the medium 
scale. Reading & analyzing a few documents is easy. It is entirely possible to 
generate PDF files, download them, print them (gasp!), extract their underlying 
plain (OCR) text, and use both traditional as well as non-traditional methods 
(text mining) to read their content. At the other end of the scale I might be 
able to count & tabulate all of the adjectives used in the 19th Century or see 
when the word "ice cream" first appeared in the lexicon.

On the other hand, I believe more realistic use cases exist: analyzing the 
complete works of Author X, comparing & contrasting Author X with Author Y, 
learning how the expression or perception of gender may have changed over time, 
determining whether or not there are themes associated with specific places, 
etc.

I imagine the following workflow:

  1. create HathiTrust collection
  2. download collection as CSV file
  3. use something like Excel, a database program, or OpenRefine
     to create subsets of the collection
  4. programmatically download items' content & metadata
  5. update CSV file with downloaded & gleaned information
  6. do analysis against the result
  7. share results & analysis

Creating the collection (#1) is easy. Search the 'Trust, mark items of 
interest, repeat until done (or tired). 

Downloading (#2) is trivial. Mash the button.

Creating subsets (#3) is easier than one might expect. Yes, there are MANY 
duplicates in a collection, but OpenRefine is GREAT at normalizing 
("clustering") data, and once it is normalized, duplicates can be removed 
confidently. In the end, a "refined" set of HathiTrust identifiers can be 
output. 

Given a set of identifiers, it ought to be easy to programmatically download 
(#4) the many flavors of 'Trust items: PDF, OCRed plain text, bibliographic 
metadata, and the cool JSON files with embedded part-of-speech analysis. This 
is the part which is giving me the most difficulty. Slow; download speeds of 
1000 bytes/minute. [2] Access control & authentication, which I sincerely 
understand & appreciate. Multiple data structures. For example, the 
bibliographic metadata is presented as a stream of JSON, and embedded in it is 
an escaped XML file, which, in turn, is the manifestation of a MARC 
bibliographic record. Yikes!

After the many flavors are downloaded, more interesting information can be 
gleaned: sentences, parts-of-speech, named entities, readability scores, 
sentiment measures, log-likelihood ratios, "topics" & other types of clusters, 
definitive characteristics of similarly classified documents, etc. In the end 
the researcher would have created a rich & thorough dataset (#5). This is the 
sort of work I do on a day-to-day basis. 

Through traditional reading as well as through statistics, the researcher can 
then do #6 against the printed PDF files and dataset. This is where I provide 
assistance, but I don't do the "real" work; this is primarily the work of 
discipline-specific researchers. 

Again, the HathiTrust is really cool, but getting content out of it is not 
easy. But maybe I'm trying to use it when my use case is secondary to the 
'Trust's primary purpose. After all, isn't the 'Trust primarily about 
preservation? "An elephant never forgets."


[1] documentation - 
https://www.hathitrust.org/documents/hathitrust-data-api-v2_20150526.pdf
[2] At a rate of 1000 bytes/minute, it would take you approximately 60 seconds 
to download this email message.

-- 
Eric Lease Morgan
Digital Initiatives Librarian, Navari Family Center for Digital Scholarship
Hesburgh Libraries

University of Notre Dame
250E Hesburgh Library
Notre Dame, IN 46556
o: 574-631-8604
e: [email protected]
w: cds.library.nd.edu

Reply via email to