[CODE4LIB] index of just about any content

Eric Lease Morgan Fri, 25 Jun 2021 10:48:46 -0700

If you could create an index of just about any content you desired, then what 
content might you index?



Background

I have been working with a number of colleagues to index sets of content and 
provide enhanced services against search results. To date we have indexed more 
than 500,000 scholarly scientific journal articles on the topic of COVID-19. We 
have also indexed about 30,000 books from the venerable Project Gutenberg. 
Behind the scenes we have done very similar things to about half of a 
collection called Early English Books Online. We have also developed tools to 
enhance search results applied against Internet Archive Scholar.

This work is currently sponsored by two distinct organizations. The first is an 
organization called XSEDE, and hosted at the Pittsburgh Supercomputer Center. 
The second is Microsoft AI for Health. Using the resources of these two 
sponsors, we have more or less accomplished our project goals. Yes, there are 
many ways we can enhance our existing implementations, but those enhancements 
do not require high performance computing systems.

That said, we desire to use these computer systems to the maximum. We literally 
have spare cycles that we can spend creating additional indexes and enhanced 
services against search results.


The Question

Considering your particular library and clientele, if you could index just 
about anything, then what would it be? Examples might include:

  * all items manifested as format Y
  * all items on subject X
  * all items written between dates D and E
  * all items written by author Z
  * all open access materials collected in... I forget the name but it does 
exist
  * anything and everything you own and is in the HathiTrust
  * the sum of theses, dissertations, books, reports, or papers written at your 
institution

We might not be able to index the content of your library, but your answers to 
the question might apply to libraries in general, and that would be helpful.

Alternatively, can you think of a set of content which is freely available and 
applicable to a wide audience? The open access material alluded to above is a 
good candidate. So would something like the whole of arXiv.

There are some limitations regarding the types of content. For example, the 
content has to be full text in nature; a large set of metadata-only records 
won't really work. Content which is already widely used is better than content 
that is not. Content that is already digitized is a must; there is no time to 
digitize content. Ironically, the content does not have to be thoroughly 
associated with metadata; to some degree the project's system generates or 
extracts the necessary metadata. Finally, content that does not have to be 
scraped from the 'Net is better than not; you would be surprised how difficult 
it is to download all the articles of a single issue of an open access journal, 
let alone the whole run of a journal.

Our project has spare cycles, and it behooves us to use them to the fullest 
extent. We are looking for additional content. What might you suggest? If you 
can identify something, then collecting it, pre-processing it, indexing it, and 
providing access to the sum of all those things can be a real and tangible 
output. Do you have any ideas?

"When you have a hammer, everything begins to look like a nail."

--
Eric Lease Morgan
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame

574/485-6870

[CODE4LIB] index of just about any content

Reply via email to