Hi Brian,
Thanks for your response.
> I'd suggest you first make sure your XML is really UTF-8
I believe it is. I used a hex editor to look at the XML source file and the
character in question (the "Registered Sign") is encoded as hex "c2 ae" which
is the proper UTF-8 encoding for that character [1]. There were other XML
files processed with the same script that had non-ASCII characters (in the 520
field where we are sticking the theses abstracts) and also verified as being
UTF-8 encoded, and they did not seem to cause any errors. The 520 field isn't
processed any differently in my script (I'm double-checking, natch) so that's
partly why I am confused.
> ...using JHOVE
I was not familiar with JHOVE, but looked it up and it sounds like a very
useful tool [2]. I have downloaded it, and will be trying it out.
-- Michael
[1] FileFormat.Info > Unicode Character 'REGISTERED SIGN' (U+00AE)
http://www.fileformat.info/info/unicode/char/00ae/index.htm
[2] JHOVE - JSTOR/Harvard Object Validation Environment
http://hul.harvard.edu/jhove/
# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/
> -----Original Message-----
> From: Brian Sheppard [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 21, 2008 1:00 PM
> To: Doran, Michael D
> Cc: [email protected]
> Subject: Re: Help for utf-8 output
>
> I'd suggest you first make sure your XML is really UTF-8, using JHOVE:
>
> /path/to/jhove/jhove -c /path/to/jhove/conf/jhove.conf -m
> utf8-hul myFile.xml
>
> If it fails you could convert to utf8, on the (perhaps
> unwarranted) assumption it's windows latin1:
>
> iconv -c -f windows-1252 -t UTF-8 myFile.xml > myFile.utf8.xml
>
> Then, of course, test myFile.utf8.xml with jhove to see if it's valid.
>
> -Brian
>
>
> On February 21, at 11:48 AM, Doran, Michael D wrote:
>
> > Hi Jackie,
> >
> > I'm working on a very similar problem... converting theses/
> > dissertations records (in XML) to MARC records. I'm still in the
> > testing stage, but have had similar problems with records with
> > diacritics in the 100 or 245 fields (however diacritics in a 520a
> > field don't seem to cause any problems). Since our records are not
> > "diacritic rich" it's hard to determine the exact extent of the
> > problem.
> >
> > I am using these versions:
> > Perl v5.8.8
> > MARC::Charset 0.98
> > MARC::Lint 1.43
> > MARC::Record 2.0
> > XML::LibXML 1.66
> >
> > Here's an example "bad" record (which I have minimized to just the
> > 245 field):
> >
> > marcdump test.mrc
> > test.mrc
> > LDR 00127cam a2200037 4500
> > 245 13 _aAn Empirical Test Of The Situational Leadership® Model In
> > Japan /
> > _cRiho Yoshioka.
> >
> > Recs Errs Filename
> > ----- ----- --------
> > 1 1 test.mrc
> >
> > When I run test.mrc through MARC::Lint, I get this message:
> >
> > Invalid record length in record 1: Leader says 00127 bytes
> but it's
> > actually 125 Invalid length in directory for tag 245 in record 1
> > field does not end in end of field character in tag 245 in record 1
> >
> > When examined in vi the character in question, a Registered Sign,
> > appears to be correctly UTF-8 encoded C2AE, and the bib Leader
> > (position 09=a) indicates that it is Unicode encoded. I've
> attached
> > the MARC record.
> >
> > I noticed that when I run your record (ck245.dat) through
> MARC::Lint,
> > I get the same invalid record length message:
> >
> > Invalid record length in record 3: Leader says 00567 bytes
> but it's
> > actually 569 field does not end in end of field character
> in tag 100
> > in record 3 field does not end in end of field character
> in tag 245
> > in record 3 Invalid indicators ".10" forced to blanks in
> record 3 for
> > tag 245
> >
> > field does not end in end of field character in tag 260 in
> record 3
> > Invalid indicators ". " forced to blanks in record 3 for tag 260
> >
> > field does not end in end of field character in tag 300 in
> record 3
> > Invalid indicators ". " forced to blanks in record 3 for tag 300
> >
> > field does not end in end of field character in tag 502 in
> record 3
> > Invalid indicators ". " forced to blanks in record 3 for tag 502
> >
> > field does not end in end of field character in tag 504 in
> record 3
> > Invalid indicators ". " forced to blanks in record 3 for tag 504
> >
> > field does not end in end of field character in tag 690 in
> record 3
> > Invalid indicators ". 4" forced to blanks in record 3 for tag 690
> >
> > Anybody have any ideas?
> >
> > -- Michael
> >
> > # Michael Doran, Systems Librarian
> > # University of Texas at Arlington
> > # 817-272-5326 office
> > # 817-688-1926 mobile
> > # [EMAIL PROTECTED]
> > # http://rocky.uta.edu/doran/
> >
> >
> >> -----Original Message-----
> >> From: Shieh, Jackie [mailto:[EMAIL PROTECTED]
> >> Sent: Tuesday, February 19, 2008 10:50 AM
> >> To: [email protected]
> >> Subject: Help for utf-8 output
> >>
> >> I was wondering if anyone has similar experience and has
> come up with
> >> good solutions to help solving the challenge below?!
> >>
> >> What I have is an Excel spreadsheet for dissertations which I have
> >> saved as a tab delimited file (examining the file in TextPad, the
> >> diacritics appears to be fine), then read in and output
> the file as a
> >> utf-8 MARC file. I <print> title field confirming author
> field that
> >> contains diacritics with the title showing proper indicator values.
> >>
> >> But when I looked the MARC itself, the fields that follow
> the field
> >> containing diacritics are all off its original position.
> See attached
> >> zip file. Examples below: first two have diacritics in a
> 100 field,
> >> last one diacritic is in 245 subfield b)
> >>
> >> 001 diss 34001
> >> 100 1 _aP<E9>rez, Nancy L.
> >> 245 _aSynchronic and Diachronic Matlatzinkan Phonology.
> >>
> >> 001 diss 34042
> >> 100 1 _aValent<ED>n-M<E1>rquez, Wilfredo
> >> 245 _aDoing being boricua :
> >>
> >> 001 diss 33892
> >> 100 1 _aDavis, Jennifer M.
> >> 245 14 _aThe Functional Complexities of Inherited Cardiac
> Troponin I
> >> Mutations :
> >> _bIdentification of Ca<B2>+ Independent Contractile
> >> Dysfunction.
> >>
> >> I would be greatly appreciate any suggestion to solve this.
> >> Thank you most kindly.
> >>
> >> Regards,
> >>
> >> --Jackie
> >>
> >> |Jackie Shieh
> >> |Data Loads & Development
> >> |Harlan Hatcher Graduate Library
> >> |University of Michigan
> >> |920 North University
> >> |Ann Arbor, MI 48109-1205
> >> |Phone: 734.763.6070 FAX: 734.615.9788
> >> |E-mail: JShieh [AT] umich [DOT] edu
> >>
> >> <test.mrc>
>
> --------------------------------------------------
> Brian Sheppard
> University of Wisconsin Digital Collections Center
> [EMAIL PROTECTED] (608) 262-3349
>
>
>
>