Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

L A Walsh Mon, 05 Apr 2021 02:26:49 -0700

On 2021/04/04 14:26, Joel Rees via Cygwin wrote:

1. What perl Unicode modules should I consider, if not Text::Unidecode?
The present need
is to be able to convert those few "foreign" characters (like
ÇĆĈĊçĉċĜĞĠĢĝģğġËÌÍÎÏÒÓÔÕ)
that are basically ASCII with accent marks to their closest ASCII
equivalents, but I'd
like to do more with Unicode in the future, without going down any
dead-ends as far as
being able to run under cygwin is concerned.


"Stripping those few foreign accent characters" is probably not really what
you want to do.

----
   Why not?  You don't know his use case and you are misinterpreting his
example as random garbage.

Those aren't a random foreign encoding -- those are C's G's then E, I O
with accent variations that he may want to collapse for purposes of storing
in a text storage and retrieval (search) application.  They are all well
formed/well-coded UTF-8 characters -- they are not some 8-bit encoding
that was remangled during a no-recoding display of them in a UTF-8
context.

I didn't know about Text::Unidecode -- but it specifically to create
Latinized alternatives to foreign characters.  That was another hint
that it wasn't a random mistake.  The manpage for it says:

It often happens that you have non-Roman text data in Unicode,but youcan't display it -- usually because you're trying to show it to auser

      via an application that doesn't support Unicode, or because the fonts

you need aren't accessible. You could represent the Unicodecharactersas "???????" or "\15BA\15A0\1610...", but that's nearly uselessto the

      user who actually wants to read what the text says.

An example was like:

tperl
use utf8;
use Text::Unidecode;
my $name="\x{5317}\x{4EB0}";

printf "name, %s == %s\n", $name, unidecode($name);
'
name, 北亰 == Bei Jing

It's not just about removing accents but getting an English
like translation based on the foreign text.






All of the characters he used as example were well coded utf-8
characters --

Those "accent characters" are misinterpreted foreign encoding (likely not
to be Unicode) characters. Simply "stripping" the "accent characters" will
basically convert them to truly meaningless junk. I suppose the meaningless
junk can then be interpreted by the reader as "used to be a be a foreign
word here", but why bother contributing further to information entropy?

2. I see some talk of Internationalization in Chapter 2 of "Setting up

Cygwin", but
cannot see anything relating to perl modules, and I don't see any easy way
to search many
months of the mailing list for a keyword... is there any information I
should know about?



Have you read the perldoc on internationalization?
--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple


--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

Reply via email to