date:20210405

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

2021-04-05 Thread L A Walsh


On 2021/04/04 14:26, Joel Rees via Cygwin wrote:



1. What perl Unicode modules should I consider, if not Text::Unidecode?
The present need
is to be able to convert those few "foreign" characters (like
ÇĆĈĊçĉċĜĞĠĢĝģğġËÌÍÎÏÒÓÔÕ)
that are basically ASCII with accent marks to their closest ASCII
equivalents, but I'd
like to do more with Unicode in the future, without going down any
dead-ends as far as
being able to run under cygwin is concerned.




"Stripping those few foreign accent characters" is probably not really what
you want to do.
  


   Why not?  You don't know his use case and you are misinterpreting his
example as random garbage.

Those aren't a random foreign encoding -- those are C's G's then E, I O
with accent variations that he may want to collapse for purposes of storing
in a text storage and retrieval (search) application.  They are all well
formed/well-coded UTF-8 characters -- they are not some 8-bit encoding
that was remangled during a no-recoding display of them in a UTF-8
context.

I didn't know about Text::Unidecode -- but it specifically to create
Latinized alternatives to foreign characters.  That was another hint
that it wasn't a random mistake.  The manpage for it says:

  It often happens that you have non-Roman text data in Unicode, 
but you
  can't display it -- usually because you're trying to show it to a 
user

  via an application that doesn't support Unicode, or because the fonts
  you need aren't accessible.  You could represent the Unicode 
characters
  as "???" or "\15BA\15A0\1610...", but that's nearly useless 
to the

  user who actually wants to read what the text says.

An example was like:

tperl
use utf8;
use Text::Unidecode;
my $name="\x{5317}\x{4EB0}";

printf "name, %s == %s\n", $name, unidecode($name);
'
name, 北亰 == Bei Jing

It's not just about removing accents but getting an English
like translation based on the foreign text.






All of the characters he used as example were well coded utf-8
characters --




Those "accent characters" are misinterpreted foreign encoding (likely not
to be Unicode) characters. Simply "stripping" the "accent characters" will
basically convert them to truly meaningless junk. I suppose the meaningless
junk can then be interpreted by the reader as "used to be a be a foreign
word here", but why bother contributing further to information entropy?

2. I see some talk of Internationalization in Chapter 2 of "Setting up
  

Cygwin", but
cannot see anything relating to perl modules, and I don't see any easy way
to search many
months of the mailing list for a keyword... is there any information I
should know about?




Have you read the perldoc on internationalization?
--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

  


--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

2021-04-05 Thread Joel Rees via Cygwin

On Mon, Apr 5, 2021 at 6:26 PM L A Walsh  wrote:
>
> On 2021/04/04 14:26, Joel Rees via Cygwin wrote:
> >
> >> 1. What perl Unicode modules should I consider, if not Text::Unidecode?
> >> The present need
> >> is to be able to convert those few "foreign" characters (like
> >> ÇĆĈĊçĉċĜĞĠĢĝģğġËÌÍÎÏÒÓÔÕ)
> >> that are basically ASCII with accent marks to their closest ASCII
> >> equivalents, but I'd
> >> like to do more with Unicode in the future, without going down any
> >> dead-ends as far as
> >> being able to run under cygwin is concerned.
> >>
> >>
> >
> > "Stripping those few foreign accent characters" is probably not really what
> > you want to do.
> >
> 
> Why not?  You don't know his use case and you are misinterpreting his
> example as random garbage.

Actually, I was specifically _not_ interpreting them as random garbage. If they
were random garbage, it wouldn't matter what he does with them.

> Those aren't a random foreign encoding -- those are C's G's then E, I O
> with accent variations that he may want to collapse for purposes of storing
> in a text storage and retrieval (search) application.

in this world many things are possible, and those may actually be intentional
strings of characters with assorted diacriticals, some sort of example of
diacriticals, and he may have some reason to force the characters to their
base form instead of regenerating the text. Or maybe I'm misinterpreting
his intent. Maybe he doesn't want to strip the diacriticals so much as convert
the combinations to something like punycode.

> They are all well
> formed/well-coded UTF-8 characters -- they are not some 8-bit encoding
> that was remangled during a no-recoding display of them in a UTF-8
> context.

I've seen lots of strings like that that are the result of e-mail software
mangling. In Japan, we call it 文字化け (mojibake). And, yes, the e-mail
software "helpfully" converts the misinterpreted bytes to well-formed
but entirely irrelevant UTF-8 in many cases.

I will acknowledge that I don't see it as often as I used to, but it
still happens.

> I didn't know about Text::Unidecode -- but it specifically to create
> Latinized alternatives to foreign characters.  That was another hint
> that it wasn't a random mistake.  The manpage for it says:
>
>It often happens that you have non-Roman text data in Unicode,
> but you
>can't display it -- usually because you're trying to show it to a
> user
>via an application that doesn't support Unicode, or because the fonts
>you need aren't accessible.  You could represent the Unicode
> characters
>as "???" or "\15BA\15A0\1610...", but that's nearly useless
> to the
>user who actually wants to read what the text says.
>
> An example was like:
>
> tperl
> use utf8;
> use Text::Unidecode;
> my $name="\x{5317}\x{4EB0}";
>
> printf "name, %s == %s\n", $name, unidecode($name);
> '
> name, 北亰 == Bei Jing

I would not call that "stripping" accent marks. It's a process of
recognizing the
characters, looking them up in a dictionary, and finding a reasonable Latinized
equivalent, which is a fairly involved process requiring a bit of
heuristics, since
there is often a many-to-many mapping involved.

> It's not just about removing accents but getting an English
> like translation based on the foreign text.

And that's actually what I was trying to point him to?

Okay, maybe my suggestions were too elliptical. Maybe I should have told
myself I was too busy and ignored his question like everybody else.

[snip]

-- 
Joel Rees

http://reiisi.blogspot.jp/p/novels-i-am-writing.html
--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

[ANNOUNCEMENT] Test: schroedinger-coordgenlibs-2.0.2-1

2021-04-05 Thread Lemures Lemniscati via Cygwin-announce via Cygwin

The following test packages have been uploaded in the Cygwin
distribution:

* libcoordgen-devel-2.0.2-1
* libcoordgen2-2.0.2-1

* schroedinger-coordgenlibs-2.0.2-1-src

--
Schrödinger: coordgenlibs 2.0.2

A library for 2D coordinate generation for chemical compounds

Source: https://github.com/schrodinger/coordgenlibs
News: https://github.com/schrodinger/coordgenlibs/releases/tag/v2.0.2
License: BSD-3-Clause License
  https://github.com/schrodinger/coordgenlibs/blob/v2.0.2/LICENSE

Cygwin Package Summary:
  https://www.cygwin.com/packages/summary/schroedinger-coordgenlibs-src.html
Cygport Source:
  https://cygwin.com/git/?p=git/cygwin-packages/schroedinger-coordgenlibs.git

--
Lemures Lemniscati
--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

ifconfig

2021-04-05 Thread Daniel L Newhouse via Cygwin

Cygwin no longer responds to ifconfig.  What is the story?

--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

Re: ifconfig

2021-04-05 Thread Marco Atzeri via Cygwin


On 05.04.2021 14:57, Daniel L Newhouse via Cygwin wrote:

Cygwin no longer responds to ifconfig.  What is the story?



I do not remember ifconfig ever been in any cygwin package.

the Windows nearest is "ipconfig"


--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

Re: EXTERNAL: Re: ifconfig

2021-04-05 Thread Wells, Roger K. via Cygwin

On 4/5/21 11:17 AM, Marco Atzeri via Cygwin wrote:
> On 05.04.2021 14:57, Daniel L Newhouse via Cygwin wrote:
>> Cygwin no longer responds to ifconfig.  What is the story?
>>
>
> I do not remember ifconfig ever been in any cygwin package.
>
nor do I
> the Windows nearest is "ipconfig"
>
>
> -- 
> Problem reports:  https://cygwin.com/problems.html
> FAQ:  https://cygwin.com/faq/
> Documentation:https://cygwin.com/docs.html
> Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple


-- 
Roger Wells, P.E.
leidos
221 Third St
Newport, RI 02840
401-847-4210 (voice)
401-849-1585 (fax)
roger.k.we...@leidos.com

--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

Re: libgccjit

2021-04-05 Thread Achim Gratz

Achim Gratz writes:
> Other than that, I'll probably provide test packages for the 10.3.0-RC1
> since I have to get that going anyway.

Now that I have also built the 32bit version I see that the JIT tests
fail there because the JIT executables don't seem to invoke the linker
correctly and ld in turn doesn't find the necessary DLL and object files
to link to.  That may only be a problem for the test environment and go
away when the compiler is properly installed or it may be something
deeper in the build / configuration.

Regards,
Achim.
-- 
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Samples for the Waldorf Blofeld:
http://Synth.Stromeko.net/Downloads.html#BlofeldSamplesExtra
--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

2021-04-05 Thread Mark Aitchison

A little more detail... I realise that stripping accents off is often not a 
good thing to do, but at the moment that basically is what I'm after, or to be 
more specific: I want to know if the character is a consonant or vowel... I 
basically throw away vowels and punctuation in this odd application. Later I 
will want to do all sorts of things with input text that might be utf8 or utf16 
or some encoding that (hopefully) I can guess and translate to the same 
standard and ultimately spit out on a web page.

There seem to be many perl modules that do similar things... I want to be able 
to distribute my code and not require people to download things from cpan. I'd 
like to stick with modules that are as stock standard as standard can be, i.e. 
are in a standard cygwin distribution, and are normally found in other perl 
environments. In a sense, searching cpan gives me too many options because that 
includes modules that might require a customer to do more than I should ask 
them to have to do, if it could have been avoided by me choosing a more 
standard way of achieving the goal in the first place.

What I probably should have asked is...
1. What perl module, that comes with cygwin, is good for telling whether a 
letter is a consonant?
2. Later on I will also need something that makes a reasonable guess as to what 
kind of encoding is used in some text (that might not have a helpful header 
telling me the answer), with the view to converting it to whatever encoding I 
want? I can find software to do this, but I would like to restrict options to 
just those a cygwin user can install with the setup program... if I'm not being 
too unrealistic about that requirement.
Thanks, Mark

On 5 Apr 2021, 22:50, at 22:50, Joel Rees via Cygwin  wrote:
>On Mon, Apr 5, 2021 at 6:26 PM L A Walsh  wrote:
>>
>> On 2021/04/04 14:26, Joel Rees via Cygwin wrote:
>> >
>> >> 1. What perl Unicode modules should I consider, if not
>Text::Unidecode?
>> >> The present need
>> >> is to be able to convert those few "foreign" characters (like
>> >> ÇĆĈĊçĉċĜĞĠĢĝģğġËÌÍÎÏÒÓÔÕ)
>> >> that are basically ASCII with accent marks to their closest ASCII
>> >> equivalents, but I'd
>> >> like to do more with Unicode in the future, without going down any
>> >> dead-ends as far as
>> >> being able to run under cygwin is concerned.
>> >>
>> >>
>> >
>> > "Stripping those few foreign accent characters" is probably not
>really what
>> > you want to do.
>> >
>> 
>> Why not?  You don't know his use case and you are misinterpreting
>his
>> example as random garbage.
>
>Actually, I was specifically _not_ interpreting them as random garbage.
>If they
>were random garbage, it wouldn't matter what he does with them.
>
>> Those aren't a random foreign encoding -- those are C's G's then E, I
>O
>> with accent variations that he may want to collapse for purposes of
>storing
>> in a text storage and retrieval (search) application.
>
>in this world many things are possible, and those may actually be
>intentional
>strings of characters with assorted diacriticals, some sort of example
>of
>diacriticals, and he may have some reason to force the characters to
>their
>base form instead of regenerating the text. Or maybe I'm
>misinterpreting
>his intent. Maybe he doesn't want to strip the diacriticals so much as
>convert
>the combinations to something like punycode.
>
>> They are all well
>> formed/well-coded UTF-8 characters -- they are not some 8-bit
>encoding
>> that was remangled during a no-recoding display of them in a UTF-8
>> context.
>
>I've seen lots of strings like that that are the result of e-mail
>software
>mangling. In Japan, we call it 文字化け (mojibake). And, yes, the e-mail
>software "helpfully" converts the misinterpreted bytes to well-formed
>but entirely irrelevant UTF-8 in many cases.
>
>I will acknowledge that I don't see it as often as I used to, but it
>still happens.
>
>> I didn't know about Text::Unidecode -- but it specifically to create
>> Latinized alternatives to foreign characters.  That was another hint
>> that it wasn't a random mistake.  The manpage for it says:
>>
>>It often happens that you have non-Roman text data in Unicode,
>> but you
>>can't display it -- usually because you're trying to show it
>to a
>> user
>>via an application that doesn't support Unicode, or because
>the fonts
>>you need aren't accessible.  You could represent the Unicode
>> characters
>>as "???" or "\15BA\15A0\1610...", but that's nearly
>useless
>> to the
>>user who actually wants to read what the text says.
>>
>> An example was like:
>>
>> tperl
>> use utf8;
>> use Text::Unidecode;
>> my $name="\x{5317}\x{4EB0}";
>>
>> printf "name, %s == %s\n", $name, unidecode($name);
>> '
>> name, 北亰 == Bei Jing
>
>I would not call that "stripping" accent marks. It's a process of
>recognizing the
>characters, looking them up in a dictionary, and finding a reasonable
>Latinized
>equivalent, which is a fairly invo

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

2021-04-05 Thread Joel Rees via Cygwin

Well, in the following, are your plans cognizant of the fact that many
major languages do not incorporate a partition between vowels and
consonants?

Do you plan to target only those languages which do?

2021年4月6日(火) 6:50 Mark Aitchison :

>
> A little more detail... I realise that stripping accents off is often not
> a good thing to do, but at the moment that basically is what I'm after, or
> to be more specific: I want to know if the character is a consonant or
> vowel... I basically throw away vowels and punctuation in this odd
> application. Later I will want to do all sorts of things with input text
> that might be utf8 or utf16 or some encoding that (hopefully) I can guess
> and translate to the same standard and ultimately spit out on a web page.
>
> There seem to be many perl modules that do similar things... I want to be
> able to distribute my code and not require people to download things from
> cpan. I'd like to stick with modules that are as stock standard as standard
> can be, i.e. are in a standard cygwin distribution, and are normally found
> in other perl environments. In a sense, searching cpan gives me too many
> options because that includes modules that might require a customer to do
> more than I should ask them to have to do, if it could have been avoided by
> me choosing a more standard way of achieving the goal in the first place.
>
> What I probably should have asked is...
> 1. What perl module, that comes with cygwin, is good for telling whether a
> letter is a consonant?
> 2. Later on I will also need something that makes a reasonable guess as to
> what kind of encoding is used in some text (that might not have a helpful
> header telling me the answer), with the view to converting it to whatever
> encoding I want? I can find software to do this, but I would like to
> restrict options to just those a cygwin user can install with the setup
> program... if I'm not being too unrealistic about that requirement.
> Thanks, Mark
>
>
--
Problem reports:  https://cygwin.com/problems.html
FAQ:  https://cygwin.com/faq/
Documentation:https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

[ANNOUNCEMENT] Test: schroedinger-coordgenlibs-2.0.2-1

ifconfig

Re: ifconfig

Re: EXTERNAL: Re: ifconfig

Re: libgccjit

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

Re: Perl Unidecode modules - which to use (if not Text::Unidecode)?

9 matches

Site Navigation

Mail list logo

Footer information