I use MarcEdit to view records and check if the mnemonic form of a diacritic
(e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best
way I've come up with so far. MarcEdit is pretty good at guessing what the
character encoding is without relying on the LDR/09 value. I think there are
some perl modules you could use that "guess" what the encoding is of a
character but I've never used them. I'm interested in finding out other methods
(preferably automated) for detecting wrong or mixed character encodings in a
MARC record.
Shelley
----- Original Message -----
> From: "Eric Lease Morgan" <[email protected]>
> To: [email protected]
> Sent: Wednesday, March 27, 2013 2:11:26 PM
> Subject: Re: reading and writing of utf-8 with marc::batch [double encoding]
>
>
> On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan <[email protected]>
> wrote:
>
> > When it calls as_usmarc, I think MARC::Batch tries to honor the
> > value set in position #9 of the leader. In other words, if the
> > leader is empty, then it tries to output records as MARC-8, and
> > when the leader is a value of "a", it tries to encode the data as
> > UTF-8.
>
> How can I figure out whether or not a MARC record contains ONLY
> characters from the UTF-8 character set?
>
> Put another way, how can I determine whether or not position #9 of a
> given MARC leader is accurate? If position #9 is an "a", then how
> can I read the balance of the record to determine whether or not all
> the characters really and truly are UTF-8 encoded?
>
> --
> Eric "This Is Almost Too Much For Me" Morgan
>
>