[CODE4LIB] wanted: solr and/or relational database experts

Eric Lease Morgan Tue, 24 Mar 2020 10:18:28 -0700


On Mar 20, 2020, at 11:40 AM, Eric Lease Morgan <[email protected]> wrote:


> https://carrels.distantreader.org/library/covid-19/
> https://carrels.distantreader.org/covid.cgi


Wanted: experts in the creation and deployment of Solr indexes and experts in 
relational database technology

The sponsors of the Distant Reader, the good folks at XSEDE, are very strongly 
encouraging me to submit a fast-track proposal to: 1) increase the size of the 
Distant Reader's underlying high performance computing system, and 2) apply the 
technology to the growing Coronavirus literature. To that end, I have outlined 
a set of tasks/functions, attached.

But the work requires deeper knowledge of Solr and relational databases than 
what I possess. For example, we will be wanting to create a Solr instance to 
index 10s of thousands of scholarly journal articles, if not just as many sets 
of non-scholarly materials. Similarly, we will be wanting to analyze the 
literature in terms of its ngrams, parts-of-speech, named entities, and 
grammars. These things can/will be saved in both tabular forms as well as a 
relational database. Much of this work has already been done, but it needs to 
scale up a few notches. We need your help. 

If you would like to apply your skills in a high performance computing 
environment, then you might want to participate in a Zoom meeting scheduled for 
Friday at 9 AM (EDT). Drop me a line to ask questions and/or "raise your hand" 
if you would like to participate in the meeting. 

Be safe.

[1] XSEDE - https://www.xsede.org

--
Eric Lease Morgan
University of Notre Dame


The Distant Reader Meets Covid-19

Below is a list of tasks/functions which we can implement as a part of a 
proposal. Many of these tasks/functions may be implemented concurrently, and 
they are barely prioritized:

Â 1. Dedicate a single Reader node to serving a full text index of
Â Â Â Â virus-related materials, and these materials will initially focus
Â Â Â Â on the CORD-19 journal literature dataset. This machine would do
Â Â Â Â very nicely at only two or four cores.

Â 2. Dedicate a few nodes to harvesting and indexing the data for
Â Â Â Â Item #1. Indexing is a bit computing heavy.

Â 3. If Items #1 and #2 are successful, then index additional, but
Â Â Â Â more difficult to acquire, journal literature, like content from
Â Â Â Â JSTOR or Zotero libraries

Â 4. Identify the likes of Team JAMS (the good students who won the
Â Â Â Â PEARC hack-a-thon), and have these people use the Reader's output
Â Â Â Â (ngrams, parts-of-speech, grammars, named-entities, etc) as the
Â Â Â Â input for things like discovering relationships between drugs,
Â Â Â Â "correlations of language" between articles, or visualizations of
Â Â Â Â the underlying data such timelines, geographic maps, or network
Â Â Â Â diagrams.

Â 5. Modify the Reader's code to use a biomedical language model
Â Â Â Â instead of the existing English language model.

Â 6. Modify the Reader's code so the feature-extraction tasks are
Â Â Â Â more distinct from the report generation tasks, thus we can
Â Â Â Â divide and conquer when it comes to report generation.

Â 7. Work on a subset of the CORD dataset, and get the subset
Â Â Â Â working

Â 8. If Items #5, #6, and #7 are successful, then increase the
Â Â Â Â carrel's content to include all the CORD content.

Â 9. If Item #8 is successful, then include content from Item #3

10. Implement a better, more interactive topic modeling
Â Â Â Â interface. Just as everybody likes to search, topic modeling is
Â Â Â Â very popular in text studies.

11. Integrate the full text indexing and the topic modeling
Â Â Â Â interface into the Reader's study carrel thus creating a coherent
Â Â Â Â whole. For example: search index, create subset, and topic model
Â Â Â Â it. For example: peruse study carrel, identify thing of interest,
Â Â Â Â link it to full text index or topic model, and return thing of
Â Â Â Â interest in the context of the original article. Etc.


What might be needed to do this work? Some of it might include:

Â 1. A re-allocation of existing cores, thus some systems
Â Â Â Â administration

Â 2. A re-examination of the shared file system because my
Â Â Â Â antidotal observations see a lot of time is spent on disk I/O

Â 3. Hacker(s) who can read delimited files over the 'Net (or a
Â Â Â Â relational database file), parse it, ask questions of it, and
Â Â Â Â visualize the results.

Â 4. Content experts who can evaluate the output of everything above.

Â 5. Time.


Here is a list of nice-to-have items -- people:

Â 1. Someone who really knows Solr -- the full text indexer of
Â Â Â Â choice

Â 2. Someone who really knows relational databases -- because a
Â Â Â Â whole lot of the data is ultimately stored in one

Â 3. Someone who can write interactive Web-pages to... interact
Â Â Â Â with the underlying data in real time.

Â 4. People to create additional study carrels of related content,
Â Â Â Â and then we can meld the resulting carrels togehter.


The above is only a set of suggestions. I hope they give you ideas, and I hope 
you are available to chat on Friday at 9 o'clock. 

In any event, additional suggestions and reactions are welcome.

--
Eric Morgan and Team Reader
March 23, 2020

[CODE4LIB] wanted: solr and/or relational database experts

Reply via email to