Module updates and HTML parsing question

Bryan Baldus Wed, 07 Jun 2006 05:42:07 -0700

I have recently posted updates to several modules, as described below (and
at <http://home.inwave.com/eija/>).


As mentioned below I'm trying to write something to parse the lists of
updated name authority records with closed dates (posted regularly on OCLC's
website <http://www.oclc.org/rss/feeds/authorityrecords/default.htm>). I
don't have much experience working with HTML/XML, so I welcome any
suggestions you may have on the best way to parse these files into a plain
text, non-Unicode, tab-separated file of "old_heading \t new_heading" pairs.
I am not able to install any modules that require compiling, and would like
the solution to work on Mac (Classic) and Windows platforms without having
to be concerned much about character encodings. My plan is to bring each
.htm file up in my Web browser (IE), and then save as a Web page, HTML only,
with the default Unicode (UTF-8) encoding. After saving the files into a
directory, the parsing program will look at each .htm file, pull out the
changed names, and put them into the single plain text file described above.

Thank you for any assistance you may be able to provide,

Bryan Baldus
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://home.inwave.com/eija

#########

Module updates:


MARC::Lint.pm:
(in CVS on SourceForge; not yet updated on CPAN)

  -DATA section updated recently to MARC Update no. 6 (Oct. 2005).

#########

MARC::Errorchecks.pm:
(posted to CPAN and on my site)

Version 1.11: Updated June 5, 2006. Released June 6, 2006.

  -Implemented check_006($record) to validate 006 (currently only does 
length check).

  -Revised validate008($field008, $mattype, $biblvl) to use internal 
sub for material specific bytes (18-34)

  -Revised validate008($field008, $mattype, $biblvl) language code 
(008/35-37) to report new 'zxx' code availability when ' ' (3-blanks) 
is the code in the record.

  -Added 'mgmt.' to %abbexceptions for check_nonpunctendingfields($record).

#########

MARC::Lint::CodeData.pm:

(Most current version is available through CVS on SourceForge with 
MARC::Lint. Also included in MARC::Errorchecks)

  -Versions 1.05-1.08 were updated with additions of codes from 
technical notices.

#########

Lintadditions.pm:
(available on my site only, since I'm still trying to merge most of 
these checks into MARC::Lint, once I find time to write tests for 
each.)

Version 1.10: Updated Oct. 17, 2005-May 18, 2006. Released June 6, 2006.

  -Added check_024() for UPC and EAN validation. Uses 
Business::Barcode::EAN13 and Business::UPC for these checks.

  -check_042() updated with valid source codes from MARC list for sources.

  -check_050() updated to report cutters not preceded by period.

  -Misc. bug fixes, including turning off uninitialized warnings for 
short 007 bytes.

#########

MARC::Global_Replace.pm:
(available only on my site, still in pre-alpha stage, so in 
/inprocess/ rather than /bryanmodules/)

Version 0.05--Updated May 1, 2006. Released June 6, 2006.

  -Revised identify_changed_hdgs($field, \%heading_data, 
\%changed_hdgs_sub_a) attempting to resolve problem of closed dates 
vs. open.

Version 0.04--Updated Feb. 13, 2006. Unreleased

  -Modified identify_changed_hdgs($field, \%heading_data, 
\%changed_hdgs_sub_a) to not report headings where new and old are 
identical.

  -Need to strip ending periods for match to work!!
  -Testing needed for sears heading changes--currently appears to fail to
match

#########

Script updates:
(available only on my site, still in pre-alpha stage, so in /inprocess/)

LCSHchangesparserpl107.txt

Version 1.07: Updated May 8, 2006

  -Revised changed heading regex to include "\&" (e.g. AT&T)

Version 1.06: Updated Oct. 5, 2005

  -Added 682 parsing

  -New_tag is set to 682 when headings are extracted from that field

  -Global_Replace will need to take these into account during parsing 
and comparison, since there is a chance that the parsing done by this 
script will produce unexpected/unreliable results.

  -682 parsing is incomplete and will likely fail on headings with
qualifiers.

Version 1.05: Updated Aug. 25, 2005


  -Revised parsing to account for some lines previously counted as bad.

#########

parsedeathdateslists.pl.txt
(available only on my site, in pre-pre-alpha stage, so in /inprocess/)

No version. Very preliminary test code.

  -Help needed in stripping entities other than subfield delimiter.

  -Help needed in selecting best HTML/XML parser for OCLC's closed dates
lists.

  -Requires pure Perl solution (I have no ability to use a compiler or to 
install extra, non-Perl programs, so only modules that came with Perl 5.6 or
5.8.0 or that are simply pm files for the site/lib directory)

  -Cross-platform capable, non-Unicode/capable of stripping non-ASCII 
characters without worrying about Mac (Classic) vs. Windows character 
sets.

#########
#########

Module updates and HTML parsing question

Reply via email to