Re: [CODE4LIB] RegEx for public services/public-facing librarians

Eric Lease Morgan Thu, 16 Jan 2020 08:14:20 -0800

On Jan 16, 2020, at 9:43 AM, Mike Monaco <[email protected]> wrote:

> A colleague and I are planning a workshop on using regular expressions and 
> expect an audience of primarily public services librarians. I was hoping 
> other users here could suggest some applications of regex that would be 
> useful for librarians who are *not* working in technical services or IT 
> 9where the applications are much more obvious to me). For example, pointing 
> out that some apps and programs, like Google Docs, can use regex for 
> find/replace, web sites or databases that support regex in searches, and so 
> on. Thanks in advance.



Hmmm... Rudimentary searching:

  * amass a set of plain text files, say Project Gutenberg texts
  * articulate an idea ("word") of interest
  * use grep to search for the idea in the set

Clean/normalize a corpus:

  * amass a set of plain text files, say Project Gutenberg texts
  * download & install BBEdit
  * use BBEdit's "Multi-File Search..." function and regular expressions to do 
things like remove digits (\d+) from the set or remove two-letter "words" from 
the set (\b\w\w\b)

Such works wonders against ugly OCR, and as a bonus, the result will be much 
more amenable to topic modeling. I suspect NotePad++ includes similar 
functionality.

From a Linux or Mac OS X command line, count & tabulate all the words & numbers 
in a file (functional, not perfect, and ugly):

  $ cat file.txt | tr -d '\r' | tr '\n' ' ' | \
    tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | \
    tr ' ' '\n' | sort | uniq -c | sort -rn | less

Great for beginning to learn the "aboutness" of a file. For extra credit, 
remove stop words.

Emphasize how the use of regular expressions is about the syntax ("shape") of 
words, not their semantics. 

--
Eric Morgan

Re: [CODE4LIB] RegEx for public services/public-facing librarians

Reply via email to