On Jan 16, 2020, at 9:43 AM, Mike Monaco <[email protected]> wrote:
> A colleague and I are planning a workshop on using regular expressions and
> expect an audience of primarily public services librarians. I was hoping
> other users here could suggest some applications of regex that would be
> useful for librarians who are *not* working in technical services or IT
> 9where the applications are much more obvious to me). For example, pointing
> out that some apps and programs, like Google Docs, can use regex for
> find/replace, web sites or databases that support regex in searches, and so
> on. Thanks in advance.
Hmmm... Rudimentary searching:
* amass a set of plain text files, say Project Gutenberg texts
* articulate an idea ("word") of interest
* use grep to search for the idea in the set
Clean/normalize a corpus:
* amass a set of plain text files, say Project Gutenberg texts
* download & install BBEdit
* use BBEdit's "Multi-File Search..." function and regular expressions to do
things like remove digits (\d+) from the set or remove two-letter "words" from
the set (\b\w\w\b)
Such works wonders against ugly OCR, and as a bonus, the result will be much
more amenable to topic modeling. I suspect NotePad++ includes similar
functionality.
From a Linux or Mac OS X command line, count & tabulate all the words & numbers
in a file (functional, not perfect, and ugly):
$ cat file.txt | tr -d '\r' | tr '\n' ' ' | \
tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | \
tr ' ' '\n' | sort | uniq -c | sort -rn | less
Great for beginning to learn the "aboutness" of a file. For extra credit,
remove stop words.
Emphasize how the use of regular expressions is about the syntax ("shape") of
words, not their semantics.
--
Eric Morgan