In the end I did pretty much as suggested, using wget and re-using session IDs. I created a bash script that gets a session ID, reads the list of ISBN numbers, and then tries to retrieve their info. If the retrieval returns a session expired then it gets a new one. It also does a decent job of outputting the retrieved records into a csv format for easy import into a database or XML.
The script, and my list of 25 test ISBNs are included below. Interestingly, about five, or 20% come up with no record found. If I try to do anything more fancy then I will learn how to query the MARC system directly. The LOC site has a lot of information available. I appreciate all of the help and suggestions I received. #!/bin/bash #*******************************************# # getLOCinfo.sh # # # # A script to read a list of ISBN numbers # # from an input file, and to retrieve the # # LOC info for that item from the LOC web # # search form. # # # # The input file is expected to contain # # a single line of ISBN numbers separated # # by whitespace. Alternatively, the file # # can contain one ISBN per line as long as # # all but the final line ends with white- # # space followed by a backslash (actually # # I think all lines can end that way). # #*******************************************# # Script Constants: BASE_URL="http://www.loc.gov/cgi-bin/zgate" E_BAD_ARGS=65 E_BAD_FILE=66 E_NO_SESSION_ID=67 NUM_ARGS=2 NUM_EXPIRED=10 SUCCESS=0 # Script variables: expired_count=0 result="Your session has expired" result_url=$BASE_URL session_url=$BASE_URL # A function to get a new sessionid: GetSessionID () { session_url=$BASE_URL"?ACTION=INIT&FORM_HOST_PORT=/prod/www/data/z3950/" session_url=$session_url"locils2.html,z3950.loc.gov,7090" sessionid=`wget $session_url -o /dev/null -O - | \ grep SESSION_ID | \ cut -d "\"" -f4` if [ -z $sessionid ] then echo "Unable to get session ID. Exiting" exit $E_NO_SESSION_ID fi } # A function to "build" the request URL: BuildURL () { url=$BASE_URL"?ACTION=SEARCH&DBNAME=VOYAGER&ESNAME=B&MAXRECORDS=20&" url=$url"RECSYNTAX=1.2.840.10003.5.10&REINIT=/cgi-bin/zgate?ACTION=INIT&" url=$url"FORM_HOST_PORT=/prod/www/data/z3950/locils2.html,z3950.loc.gov," url=$url"7090&srchtype=1,1016,2,102,3,3,4,2,5,100,6,1&SESSION_ID=$1&" url=$url"TERM_1=$2" } # Make sure file names were supplied when the script was called: if [ $# -ne $NUM_ARGS ] then echo "ERROR: Incorrect number of parameters supplied. Exiting..." exit $E_BAD_ARGS fi # Make sure the input file exists and is not empty: if [ ! -f "$1" ] || [ ! -s "$1" ] then echo "ERROR: $1 not found or is an empty file. Exiting..." exit $E_BAD_FILE fi # Truncate the output file if necessary: if [ -s $2 ] then echo -n "Warning: $2 exists and is not empty. Continue [y/N]? " read input if [ `echo $input | tr A-Z a-z` != "y" ] then echo "Please provide a valid output file name" exit $E_BAD_FILE fi cat /dev/null > $2 fi # Get a session ID: GetSessionID # Read the file contents: read isbn_list < $1 for isbn in $isbn_list do BuildURL $sessionid $isbn result=`wget $url -o /dev/null -O - | tr "\n" " "` while [ -n "`echo $result | sed -n -e '/Your session has expired/Ip'`" ] && [ $expired_count -lt $NUM_EXPIRED ] do let "expired_count+=1" GetSessionID BuildURL $sessionid $isbn result=`wget $url -o /dev/null -O - | tr "\n" " "` done if [ $expired_count -eq $NUM_EXPIRED ] then echo "Unable to get session ID. Exiting" exit $E_NO_SESSION_ID else expired_count=0 fi if [ -n "`echo $result | sed -n -e '/No records matched your query/Ip'`" ] then # Print the not found message to stderr: echo "$isbn: No record found" >&2 else echo -n "\"$isbn\"," >> $2 echo $result | sed -n -e 's/.*<pre>\(.*\)<\/pre>.*/\1/Ip' | \ sed -e 's/ \+/ /g' | \ sed -e 's/^Author: /"/' | \ sed -e 's/\., [0-9]\{4\}-[0-9]\{0,4\} \(Title: \)/. \1/' | \ sed -e 's/\. Title: /","/' | \ sed -e 's/\. Published: /","/' | \ sed -e 's/, c\([0-9]\{4\}\)\. LC Call No.: /","\1","/' | \ sed -e 's/ *$/"/' \ >> $2 fi done exit $SUCCESS ##### ISBN List: ############################################################### 0805375651 \ 0314027157 \ 0201087987 \ 9780980232714 \ 0131774115 \ 0789731274 \ 1874416656 \ 1886411484 \ 9780425238981 \ 0070726922 \ 0495011622 \ 1565927699 \ 0673524841 \ 0721659659 \ 9781847991683 \ 0596100795 \ 0596001584 \ 9780980455205 \ 0835930513 \ 9780954452971 \ 0619121475 \ 9780321553577 \ 0130424110 \ 0201612445 \ 9780123705488 Sent - Gtek Web Mail -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected] Archive: http://lists.debian.org/[email protected]

