Doran, Michael D a écrit :
> Hi Henri,
>
>   
>> Is there a reason why MARC::File::XML considers only a very 
>> strict subset of utf-8 as valid ?
>>     
>
> I would guess that it has to do with adhering to the MARC-21 repertoire of 
> characters, so as to facilitate the round-trip conversion between the MARC-8 
> and Unicode character sets [1,2].  At some point in the future the MARC-21 
> repertoire will be decoupled from what was defined for MARC-8.
>   
>> For instance no linebreak...
>>     
> Control characters such as line breaks are a bit of a different issue.  The 
> MARC-21 standard currently allows for only a handful of control characters, 
> not including (as you have discovered) the line break [3].
>   
>> This could be a really BIG trouble for kanjis or hindu languages imho.
>>     
> The MARC-21 repertoire of characters includes East Asian Ideographs (Han), 
> Japanese Hiranga and Katakana, and Korean Hangul [4,5].  I don't believe that 
> Indic scripts in the vernacular would be valid MARC-21 characters. 
>
> Are you finding any cases where the Marc::File::XML parser is dropping valid 
> MARC-21 characters?
>   
Hi Michael.
And thanks for your answer. And all the links you pointed at.

But this puzzles me.
Indeed, imho, and Paul agrees with me, I had rather keep all the
characters used by a customer rather than modifying or dropping data.

The problem is that French library or any non US-MARC library doesnot
HAVE to use MARC-21 characters. They use ISO6937 or ISO5426, or latin1
or event directly UTF-8.
So for them Having valid MARC-21 characters is not their goal. They want
to keep their data safe.

Woulditnot possible to add some feature to M::F::X which would allow
people to collect UTF8 data as such without checking MARC21 ?

-- 
Henri Damien LAURENT et Paul POULAIN
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


Reply via email to