Lintadditions, Errorchecks updates

2004-12-06 Thread Bryan Baldus
I have updated my modules and Web site once again. Changes are listed below,
including a new module, MARC::Lint::CodeData. MARC::Errorchecks' validate008
subroutine has been revised extensively, reporting errors more consistent
with those of the other checking subroutines.

MARC::Lint::CodeData:

Version 1.00 (original version): First release, Dec. 5, 2004.
  -Included in MARC::Errorchecks distribution on CPAN.
  -Used by MARC::Lintadditions.

MARC::Errorchecks:

Version 1.04: Updated Nov. 4-Dec. 4, 2004. Released Dec. 5, 2004.

  -Updated validate008() to use MARC::Lint::CodeData.
  -Removed DATA section, since this is now in MARC::Lint::CodeData.
  -Updated check_008() to use the new validate008().
  -Revised bib. refs. check to require 'reference' to be followed by
optional 's', optional period, and word boundary (to catch things like
'referenced'.

MARC::Lintadditions:

Version 1.06: Updated Nov. 21-24, 2004. Released Dec. 5, 2004.

  -Removed readcodedata(), replaced with separate data pack,
MARC::Lint::CodeData
  -Updated check_040, check_041 and check_043 to use MARC::Lint::CodeData.
  -Deleted the DATA section based on the above changes.
  -Misc. bug fixes.
  -Reports 13 digit ISBNs as errors pending updating of Business::ISBN to
account for 13 digit ISBNs.

MARC::BBMARC:

Version 1.08: Updated Oct 31, 2004. Released Dec. 5, 2004.

  -New method, as_array, an add-on to MARC::Field which breaks down a
MARC::Field object into a flat array, returns a ref to that array.
  -Misc. cleanup.


===

Thank you,

Bryan Baldus
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://home.inwave.com/eija


RE: Character sets - kind of solved?

2004-12-06 Thread Doran, Michael D
> One (perhaps large) caveat: as of now all USMARC records are assumed
> to be MARC-8 encoded, and the data within is always run through
> to_utf8/to_marc8 during XML export/import.

The MARC-21 standard allows for either MARC-8 or UCS/Unicode.  Position
09 in the record leader indicates the character encoding: a "blank" for
MARC-8, and an "a" for UCS/Unicode.  Perhaps your patch could test for
this and then only apply the transformation when required.  Note: I
believe the leader itself is limited to characters in the ASCII range,
so you wouldn't have to know the encoding of the record prior to parsing
the leader.

> What that means is that
> the records from the problem below (containing UTF8 directly in the
> data, without an encoding marker) would probably break during export
> to XML.

The original record from John Hammer did not contain UTF-8, it contained
MARC-8.  I believe that the fact that the combining MARC-8 characters
were replaced by a generic replacement character only indicates that the
app he was using to view the data (post processing by MARC::Record) was
using a character set in which hex E5 and F2, encoded as single octets,
were not valid characters in that app's character set.  That app's
character set was apparently Unicode (UTF-8) and so E5 and F2 were
replaced by U+FFFD.  That's the long way of saying that the patch should
work fine in his case.  :-)

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

> -Original Message-
> From: Mike Rylander [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, December 04, 2004 1:31 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Character sets - kind of solved?
> 
> I've run into some record encoding issues myself, though not the
> problem from below.  In any case, this got me thinking about the
> current state of MARC::File::XML, specifically that it could not
> handle MARC8 encoded records.
> 
> I submitted a patch a while back to hack around this, but that just
> lets us get the MARC records into well formed XML.  Basically, it just
> lets you set the encoding on the XML to something that has embedded
> 8-bit characters, like ISO-8859-1, aka LATIN1.
> 
> But that is far from optimal, since the data is being misinterpreted. 
> So I took a look at using MARC::Charset inside MARC::File::XML, and
> I've got a working patch that correctly transcodes records from
> USMARC(MARC-8) to MARC21slim(UTF8) and back again.
> 
> It's attached below, if anyone would be so kind as to test it.  If all
> goes well we sould be able to actually use MARC::File::XML in
> production.  If you do decide to test it, it requires MARC::Charset.
> 
> One (perhaps large) caveat: as of now all USMARC records are assumed
> to be MARC-8 encoded, and the data within is always run through
> to_utf8/to_marc8 during XML export/import.  What that means is that
> the records from the problem below (containing UTF8 directly in the
> data, without an encoding marker) would probably break during export
> to XML.
> 
> The attached tarball contains a patched XML.pm and SAX.pm.  Replace
> your current MARC/File/XML.pm and MARC/File/SAX.pm with those and you
> should be good to go.  I've also included the scripts I used to test
> and one of my old MARC8 encoded records.  http://redlightgreen.com
> confirms that the illustrators name is properly transcoded.
> 
> On Fri, 3 Dec 2004 17:53:32 -0600, Doran, Michael D 
> <[EMAIL PROTECTED]> wrote:
> > First off, Ashley's suggestion that the original encoding was likely
> > MARC-8 is correct.  The author's Arabic name, 
> transliterated into the
> > Latin alphabet, should be "Bis{latin small letter a with 
> macron}{latin
> > small letter t with dot below}{latin small letter i with macron},
> > Mu{latin small letter h with dot below}ammad."  I am basing this on
> > MARC-21 records that can be seen in UCLA's online catalog 
> [1].  So, if
> > the above name is encoded in MARC-8 then the underlying 
> code would match
> > John's original code points [2]:
> >  > >> Looking at the name with a hex editor, it gives, with 
> hex values
> > in curly brackets,
> >  > >> "Bis{e5}a{f2}t{e5}i, Mu{f2}hammad."
> > 
> > Then the question becomes: "What happened?"
> > 
> >  > >> the name now appears as
> >  > >> "Bis{ef bf bd}a{ef bf bd}t{ef bf bd}i, Mu{ef bf bd}hammad."
> > 
> > The fact that one byte turned into three bytes, suggests 
> UTF-8 encoding.
> > And the fact that *both* MARC-8 combining characters (i.e. "e5" and
> > "f2") now appear as the *same* combination of characters 
> (i.e. "ef bf
> > bd") suggests that it was not an encoding translation from one coded
> > character set to the equivalent codepoint in another 
> character set.  If
> > we assume UTF-8 and convert UTF-8 "ef bf bd" to its Unicode 
> code point,
> > we get U+FFFD [3].  If we look up U+FFFD we see that it is the
> > "REPLACEMENT CHARACTER" [4].
> > 
> > Since MARC::

Updating MARC::File::XML (was Re: Character sets - kind of solved?)

2004-12-06 Thread Mike Rylander
On Mon, 6 Dec 2004 08:54:21 -0600, Doran, Michael D <[EMAIL PROTECTED]> wrote:
> > One (perhaps large) caveat: as of now all USMARC records are assumed
> > to be MARC-8 encoded, and the data within is always run through
> > to_utf8/to_marc8 during XML export/import.
> 
> The MARC-21 standard allows for either MARC-8 or UCS/Unicode.  Position
> 09 in the record leader indicates the character encoding: a "blank" for
> MARC-8, and an "a" for UCS/Unicode.  Perhaps your patch could test for
> this and then only apply the transformation when required.  Note: I
> believe the leader itself is limited to characters in the ASCII range,
> so you wouldn't have to know the encoding of the record prior to parsing
> the leader.

Yeah.  I've got a new version that takes this into account.  The
problem is that MARC::Record on modern Perls (post 5.6) doesn't seem
to work properly with Unicode encoded records, at least not without
some Encode.pm work.  It seems to truncate fields containing combining
octets in cases where there is a valid LATIN1 (well, current system
encoding/locale, actually) version of the character, such as LATIN1
char 0xF8.  This is due to modern Perls "helping" you with string
encoding.  Because of that, I am now "downgrading" all XML Unicode
records to MARC8, though there shouldn't be any loss of data.  I am
now using the Encode module inside ...::XML.pm and ...::SAX.pm to
handle this, but until I get everything fully tested I'll continuing
to reencode records to MARC8.  Older Perls (pre 5.6) should not
actually need Encode's help, but it should not hurt in those cases.

> 
> > What that means is that
> > the records from the problem below (containing UTF8 directly in the
> > data, without an encoding marker) would probably break during export
> > to XML.
> 
> The original record from John Hammer did not contain UTF-8, it contained
> MARC-8.  I believe that the fact that the combining MARC-8 characters
> were replaced by a generic replacement character only indicates that the
> app he was using to view the data (post processing by MARC::Record) was
> using a character set in which hex E5 and F2, encoded as single octets,
> were not valid characters in that app's character set.  That app's
> character set was apparently Unicode (UTF-8) and so E5 and F2 were
> replaced by U+FFFD.  That's the long way of saying that the patch should
> work fine in his case.  :-)
> 

I understand.  It wasn't that I was trying to solve that particular
problem, it just got me thinking about MARC::File::XML.  Sorry for any
confusion there.

I'm using File::XML regularly now, and I'm trying to fix it up.  I am
glad that the patch should work with those records, though!

One last note.  I'm rather new to encoding issues as they pertain to
MARC8, since they cannot by implicitly handled by Perl, as other
encodings can be in some cases.  This will be evolving, and I will do
my best not to break anything and to follow the MARC standard, but
IANAL(ibrarian), so be gentle. ;)

Thanks for the pointers, and I'll send more updates here unless
everyone would rather I not. :)

-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

> 
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 cell
> # [EMAIL PROTECTED]
> # http://rocky.uta.edu/doran/
> 
> > -Original Message-
> > From: Mike Rylander [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, December 04, 2004 1:31 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Character sets - kind of solved?
> >
> > I've run into some record encoding issues myself, though not the
> > problem from below.  In any case, this got me thinking about the
> > current state of MARC::File::XML, specifically that it could not
> > handle MARC8 encoded records.
> >
> > I submitted a patch a while back to hack around this, but that just
> > lets us get the MARC records into well formed XML.  Basically, it just
> > lets you set the encoding on the XML to something that has embedded
> > 8-bit characters, like ISO-8859-1, aka LATIN1.
> >
> > But that is far from optimal, since the data is being misinterpreted.
> > So I took a look at using MARC::Charset inside MARC::File::XML, and
> > I've got a working patch that correctly transcodes records from
> > USMARC(MARC-8) to MARC21slim(UTF8) and back again.
> >
> > It's attached below, if anyone would be so kind as to test it.  If all
> > goes well we sould be able to actually use MARC::File::XML in
> > production.  If you do decide to test it, it requires MARC::Charset.
> >
> > One (perhaps large) caveat: as of now all USMARC records are assumed
> > to be MARC-8 encoded, and the data within is always run through
> > to_utf8/to_marc8 during XML export/import.  What that means is that
> > the records from the problem below (containing UTF8 directly in the
> > data, without an encoding marker) would probably break during export
> > to XML.
> >
> > Th

Re: Character sets - kind of solved?

2004-12-06 Thread John Hammer
On Mon, 6 Dec 2004 08:54:21 -0600
"Doran, Michael D" <[EMAIL PROTECTED]> wrote:

> The original record from John Hammer did not contain UTF-8, it contained
> MARC-8.  I believe that the fact that the combining MARC-8 characters
> were replaced by a generic replacement character only indicates that the
> app he was using to view the data (post processing by MARC::Record) was
> using a character set in which hex E5 and F2, encoded as single octets,
> were not valid characters in that app's character set.  That app's
> character set was apparently Unicode (UTF-8) and so E5 and F2 were
> replaced by U+FFFD.  That's the long way of saying that the patch should
> work fine in his case.  :-)
> 
You are correct in assuming the locale environment is set up for UTF-8 on my 
computer. However, that wouldn't explain why the record is different 
pre-processing vs. post-processing with MARC::Record. Viewing the two records 
with the same app (in this case vi) gives different results, both incorrect.

I tried changing the locale to ISO-8859-1 but that was no help. Does this mean 
I am unable to programmatically modify records that come to me in MARC-8?

An interesting discussion. Thanks to all for your input.


-- 

John C. Hammer, MMus, MLIS
Automation Librarian
Library and Media Services
San Antonio College
1001 Howard St.
San Antonio, TX  78212
(210)733-2669 (v)  (210)733-2597 (f)
  [EMAIL PROTECTED]