yes probably this is where i was also heading, but thought there was a more clever way. Also, is there a good perl normaliser? I have not had any experience with:
http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm For starters if i could spot only the odd letters between latin and greek regex character classes, i would be more than happy 2015-02-10 17:04 GMT+02:00 Kool,Wouter <wouter.k...@oclc.org>: > Apologies, I missed the subject line... > > Then you might use the regex character classes. For instance $text =~ > m/\p{Hiragana}/; matches any Japanese Hiragana character. I have not tested > it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you > find the character class that most characters match and you look for the > exceptions. Would that help? > > > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 15:56 > *To:* Kool,Wouter > *Cc:* perl4lib@perl.org > *Subject:* Re: UNICODE character identification > > > > utf-8, > > > > thank you > > > > 2015-02-10 16:54 GMT+02:00 Kool,Wouter <wouter.k...@oclc.org>: > > What encoding is your data in? utf8? Single-byte encoding? Marc8? That > information matters a lot to determine whether your idea would work. If it > is in a single-byte encoding there is often no way to determine the script > the character belongs to. > > > > > > *Wouter Kool* > Metadata Specialist *·* OCLC B.V. > Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands > t +31-(0)71-524 6500 > > wouter.k...@oclc.org *·* www.oclc.org > > [image: Volg @OCLC_NL op Twitter] <https://twitter.com/OCLC_NL>*[image: > Volg OCLC (Nederland) op LinkedIn]* > <https://www.linkedin.com/company/oclc-nederland->*[image: Abonneer op > OCLCVideo]* > <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO> > > *[image: > https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C000000227Uz&oid=00D80000000ZRv8&lastMod=1409843680000]* > <http://www.oclc.org/> > > > > > > > > > > > > *From:* George Milten [mailto:george.mil...@gmail.com] > *Sent:* dinsdag 10 februari 2015 13:27 > *To:* perl4lib@perl.org > *Subject:* UNICODE character identification > > > > Hello friendly folks, > > > > follows what i am trying to do, and i am looking for your help in order to > find the most clever way to achieve this: > > > > We have records, that include typos like this: we have a word say Plato, > where the last o is inputted with the keyboard set to Greek language, so we > need something that would parse all metadata in a per character basis, > check against what is the script language that the majority of characters > the word belongs to have, and return the odd characters, the script they > belong, and the record identifier they were found in, so as to be able to > correct them > > > > thank you in advance > > >