Re: [CODE4LIB] Text analysis for MARC data?

Marijane White Thu, 04 Jan 2024 10:55:42 -0800

A very elegant solution! I was thinking it might be possible to do something 
similar in OpenRefine with a Text Facet, if the data is in a format OpenRefine 
can understand and if it is not too large.

Marijane White, M.S.L.I.S. 
Data and Research Engagement Librarian, Assistant Professor 
Oregon Health & Science University Library 

Email: whi...@ohsu.edu <mailto:whi...@ohsu.edu> 
ORCiD: https://orcid.org/0000-0001-5059-4132 
<https://orcid.org/0000-0001-5059-4132> 

On 1/4/24, 9:11 AM, "Code for Libraries on behalf of Eric Lease Morgan" 
<CODE4LIB@LISTS.CLIR.ORG <mailto:CODE4LIB@LISTS.CLIR.ORG> on behalf of 
00000107b9c961ae-dmarc-requ...@lists.clir.org 
<mailto:00000107b9c961ae-dmarc-requ...@lists.clir.org>> wrote:

On Jan 4, 2024, at 11:26 AM, Alison Clemens <alison.clem...@gmail.com 
<mailto:alison.clem...@gmail.com>> wrote:

> Has anyone here done text analysis-type work on MARC data, particularly on
> topical subject headings? I work closely with my library's digital
> collections, and I am interested in seeing what kinds of topics (as
> indicated in our descriptive data) are represented in our
> digital collections. So, I have the corresponding MARCXML for the
> materials and have extracted the 650s as a string (e.g., *650 $a World War,
> 1914-1918 $x Territorial questions $v Maps*), but I'm a little stuck on how
> to meaningfully analyze the data. I tried feeding the data into Voyant, but
> I think it's too large of a corpus to run properly there, and regardless,
> the MARC data is (of course) delimited in a specific way.
> 
> Any / all perspectives or experience would be welcome -- please do get in
> touch directly (at alison.clem...@gmail.com 
> <mailto:alison.clem...@gmail.com>), if you'd like.
> 
> --
> Alison Clemens
> Beinecke Rare Book and Manuscript Library, Yale University

The amount of available content, relative to the size of the values in 6xx, is 
kinda small. The number of things might be large, but the number of result 
words is small. That said, I can think of a number of ways such analysis can be 
done. The process can be boiled down to four very broad steps:

1) articulating more thoroughly what questions you want to ask of the MARC
2) distilling the MARC into one or more formats amenable to a given 
modeling/analysis process
3) modeling/analyzing the data
4) evaluating the results

For example, suppose you simply wanted to know the frequency of each FAST 
subject heading. I would loop through each 6xx field in each MARC, extract the 
given subjects, parse the values into FAST headings, and output the result to a 
file. You will then have file looking something like this:

United States
World War, 1914-1918
Directories
Science, Ancient
Maps
Librarians
Origami
Science, Ancient
Origami
Maps
Philosophy
Dickens, Charles
World War, 1914-1918
Territorial questions
Maps

Suppose the file is named headings.txt. You can then sort the list, use the 
Linux uniq command to count and tabulate each heading. Pipe the result to the 
sort command, and you will end up a with groovy frequency list. The command 
will look something like this:

cat headings.txt | sort | uniq -c | sort -rn

Here is the result:

3 Maps
2 Science, Ancient
2 Origami
1 World War, 1914-1918
1 Territorial questions
1 Philosophy
1 Librarians
1 Directories
1 Dickens, Charles

Such a process will give you one view of your data. Relatively quick and easy.

Suppose you wanted to extract latent themes from the content of MARC 6xx. This 
is sometimes called "topic modeling", and MALLET is the grandaddy of topic 
modeling tools. Loop through each 6xx field of your MARC records, extract the 
headings, and for each MARC, create a plain text file containing the data. In 
the end you will thousands of tiny plain text files. You can then turn MALLET 
against the files, and the result will a set of weighted themes -- "topics". 
For extra credit, consider adding the values of 245, 1xx, 5xx to your output. 
If each plain text file is associated with a metadata value (such as date, 
collection, format, etc.), then the resulting topic model can be pivoted, and 
you will be able to observe how the topics compare to the metadata values. For 
example, you could answer the question, "For items in these formats, what are 
the more frequent topics?" or "How have our subjects ebbed & flowed over time?"

I do this sort of work all the time; what you are describing is a very large 
part of my job. Here in our scholarship center people bring me lots o' content, 
and I use processes very much like the ones outlined above to help the people 
use & understand it.

Fun!

--
Eric Morgan <emor...@nd.edu <mailto:emor...@nd.edu>>
Navari Family Center for Digital Scholarship
University of Notre Dame

574/631-8604

Re: [CODE4LIB] Text analysis for MARC data?

Reply via email to