On 9/18/10, Corinna Vinschen wrote: > On Sep 18 11:21, Corinna Vinschen wrote: >> On Sep 17 22:30, Lee wrote: >> > On 9/16/10, Corinna Vinschen wrote: >> > > On Sep 15 18:30, Lee wrote: >> > >> I don't know if this is just a problem with the cygwin version of >> > >> awk, >> > >> me misunderstanding something or what, but it looks like gsub isn't >> > >> working correctly in awk: >> > >> $ sh /tmp/test.awk >> > >> s= ::0:: should = ::S0:: >> > >> >> > >> $ cat /tmp/test.awk >> > >> awk ' >> > >> BEGIN { >> > >> s="Serial0" >> > >> gsub("[a-z]","",s) >> > >> printf("s= ::%s:: should = ::S0::\n", s) >> > >> exit >> > >> } ' >> > >> >> > >> I also tried it with IGNORECASE=0 and with "awk --traditional" - same >> > >> results. >> > > Works fine for me: >> > >> > Comment out the 'set LANG=" and gsub works fine: >> > $ echo $LANG >> > C.UTF-8 >> > >> > $ sh /tmp/test.awk >> > s= ::S0:: should = ::S0:: >> > >> > $ export LANG=en_US.UTF-8 >> > >> > $ sh /tmp/test.awk >> > s= ::0:: should = ::S0:: >> > >> > So awk gsub works for me again - thank you! >> > >> > Just out of curiosity, why would setting LANG to en_US break >> > case-sensitivity in gsub? >> >> I don't know either. I just asked the upstream maintainer. At least it >> isn't a Cygwin problem, since it also behaves the same on Linux. > > I got reply from the upstream maintainer. Case-sensitivity in gsub is > not broken, rather it's really a language dependent difference. > > If LANG is "en_US" or "en_US.utf8", then the regular expression "[a-z]" > does *not* correspond anymore to the ASCII codes. Rather it corresponds > to something like "[aAbBcCdD...zZ]", independent of the actual character > encoding ISO-8859-1 or UTF-8.
Thank you - I appreciate the follow-up. Was the reply from the upstream maintainer answered on a mailing list? (& if so, which one?) I'd like to understand the problem they're solving.. I get the idea of "[[:lower:]]" working regardless of collating order of the current char set, but how "[a-z]" gets translated to something like "[aAbBcCdD...zZ]" boggles my mind. It seems like they had to have gone out of their way to translate [a-z] into a case-insensitive RE. But regardless, it still seems broken to me. From the gawk man page: The various command line options control how gawk interprets characters in regular expressions. --traditional Traditional Unix awk regular expressions are matched. The GNU operators are not special, interval expressions are not available, and neither are the POSIX character classes ([[:alnum:]] and so on). The way I read it, I can change the line in my .bashrc from export AWK="/usr/bin/gawk.exe" to export AWK="/usr/bin/gawk.exe --traditional" and not have to change any scripts that use $AWK. If "--traditional" meant one no longer was able to do a case-sensitive RE ("[a-z]" gets translated into "[aAbB...zZ]" and "[[:lower:]]" isn't interpreted as a lower case character RE) I'd expect that to be high-lighted in the man page. But like I said in my initial msg, --traditional doesn't fix the problem: $ cat test.awk awk --traditional ' BEGIN { s="Serial0" gsub("[a-z]","",s) printf("s= ::%s:: should = ::S0::\n", s) exit } ' $ export LANG=en_US.UTF-8 $ sh test.awk s= ::0:: should = ::S0:: > What you really want is this: s/really want/have to do/ > BEGIN { > s="Serial0" > gsub("[[:lower:]]","",s) > printf("s= ::%s:: should = ::S0::\n", s) > exit > } > > The "[[:lower:]]" expression always catches all valid lowercase letters, > independent of the langauge, territory, and charset used. At least for the short term, my work-around is not setting LANG. Thanks again, Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple