Re: UNICODE character identification

George Milten Tue, 10 Feb 2015 07:10:56 -0800

yes probably this is where i was also heading, but thought there was a more
clever way. Also, is there a good perl normaliser? I have not had any
experience with:


http://search.cpan.org/~sadahiro/Unicode-Normalize-1.18/Normalize.pm

For starters if i could spot only the odd letters between latin and greek
regex character classes, i would be more than happy

2015-02-10 17:04 GMT+02:00 Kool,Wouter <[email protected]>:

>  Apologies, I missed the subject line...
>
> Then you might use the regex character classes. For instance $text =~ 
> m/\p{Hiragana}/;  matches any Japanese Hiragana character. I have not tested 
> it, but I suppose /[^\p{Latin}]/ would match any non-latin characters. So you 
> find the character class that most characters match and you look for the 
> exceptions. Would that help?
>
>
>
>
>
>
>
> *From:* George Milten [mailto:[email protected]]
> *Sent:* dinsdag 10 februari 2015 15:56
> *To:* Kool,Wouter
> *Cc:* [email protected]
> *Subject:* Re: UNICODE character identification
>
>
>
> utf-8,
>
>
>
> thank you
>
>
>
> 2015-02-10 16:54 GMT+02:00 Kool,Wouter <[email protected]>:
>
> What encoding is your data in? utf8? Single-byte encoding? Marc8? That
> information matters a lot to determine whether your idea would work. If it
> is in a single-byte encoding there is often no way to determine the script
> the character belongs to.
>
>
>
>
>
> *Wouter Kool*
> Metadata Specialist *·* OCLC B.V.
> Schipholweg 99 *·* P.O. Box 876 *·* 2300 AW Leiden *·* The Netherlands
> t +31-(0)71-524 6500
>
> [email protected] *·* www.oclc.org
>
> [image: Volg @OCLC_NL op Twitter] <https://twitter.com/OCLC_NL>*[image:
> Volg OCLC (Nederland) op LinkedIn]*
> <https://www.linkedin.com/company/oclc-nederland->*[image: Abonneer op
> OCLCVideo]*
> <https://www.youtube.com/playlist?list=PLWXaAShGazu4t2h02aeXBFJO4MecNWSMO>
>
> *[image:
> https://c.na8.content.force.com/servlet/servlet.ImageServer?id=015C000000227Uz&oid=00D80000000ZRv8&lastMod=1409843680000]*
> <http://www.oclc.org/>
>
>
>
>
>
>
>
>
>
>
>
> *From:* George Milten [mailto:[email protected]]
> *Sent:* dinsdag 10 februari 2015 13:27
> *To:* [email protected]
> *Subject:* UNICODE character identification
>
>
>
> Hello friendly folks,
>
>
>
> follows what i am trying to do, and i am looking for your help in order to
> find the most clever way to achieve this:
>
>
>
> We have records, that include typos like this: we have a word say Plato,
> where the last o is inputted with the keyboard set to Greek language, so we
> need something that would parse all metadata in a per character basis,
> check against what is the script language that the majority of characters
> the word belongs to have, and return the odd characters, the script they
> belong, and the record identifier they were found in, so as to be able to
> correct them
>
>
>
> thank you in advance
>
>
>

Re: UNICODE character identification

Reply via email to