Doran, Michael D a écrit : > Hi Henri, > > >> Is there a reason why MARC::File::XML considers only a very >> strict subset of utf-8 as valid ? >> > > I would guess that it has to do with adhering to the MARC-21 repertoire of > characters, so as to facilitate the round-trip conversion between the MARC-8 > and Unicode character sets [1,2]. At some point in the future the MARC-21 > repertoire will be decoupled from what was defined for MARC-8. > >> For instance no linebreak... >> > Control characters such as line breaks are a bit of a different issue. The > MARC-21 standard currently allows for only a handful of control characters, > not including (as you have discovered) the line break [3]. > >> This could be a really BIG trouble for kanjis or hindu languages imho. >> > The MARC-21 repertoire of characters includes East Asian Ideographs (Han), > Japanese Hiranga and Katakana, and Korean Hangul [4,5]. I don't believe that > Indic scripts in the vernacular would be valid MARC-21 characters. > > Are you finding any cases where the Marc::File::XML parser is dropping valid > MARC-21 characters? > Hi Michael. And thanks for your answer. And all the links you pointed at.
But this puzzles me. Indeed, imho, and Paul agrees with me, I had rather keep all the characters used by a customer rather than modifying or dropping data. The problem is that French library or any non US-MARC library doesnot HAVE to use MARC-21 characters. They use ISO6937 or ISO5426, or latin1 or event directly UTF-8. So for them Having valid MARC-21 characters is not their goal. They want to keep their data safe. Woulditnot possible to add some feature to M::F::X which would allow people to collect UTF8 data as such without checking MARC21 ? -- Henri Damien LAURENT et Paul POULAIN Consultants indépendants en logiciels libres et bibliothéconomie (http://www.koha-fr.org)