utf-8, thank you
2015-02-10 16:54 GMT+02:00 Kool,Wouter <wouter.k...@oclc.org>: > What encoding is your data in? utf8? Single-byte encoding? Marc8? That > information matters a lot to determine whether your idea would work. If it > is in a single-byte encoding there is often no way to determine the script > the character belongs to. > > > > > > *Wouter Kool* > Metadata Specialist *·* OCLC B.V. > Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands > t +31-(0)71-524 6500 > > wouter.k...@oclc.org *·* www.oclc.org > > [image: Volg @OCLC_NL op Twitter] <https://twitter.com/OCLC_NL> *[image: > Volg OCLC (Nederland) op LinkedIn]* > <https://www.linkedin.com/company/oclc-nederland->*[image: Abonneer op > OCLCVideo]* > <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> > > *[image: > https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C000000227Uz&oid=00D80000000ZRv8&lastMod=1409843680000]* > <http://www.oclc.org/> > > > > > > > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 13:27 > *To:* perl4lib@perl.org > *Subject:* UNICODE character identification > > > > Hello friendly folks, > > > > follows what i am trying to do, and i am looking for your help in order to > find the most clever way to achieve this: > > > > We have records, that include typos like this: we have a word say Plato, > where the last o is inputted with the keyboard set to Greek language, so we > need something that would parse all metadata in a per character basis, > check against what is the script language that the majority of characters > the word belongs to have, and return the odd characters, the script they > belong, and the record identifier they were found in, so as to be able to > correct them > > > > thank you in advance >