Question about porting the upstream "dos2unix" utilities. These implementations provide capabilities to convert text files from a certain limited set of INPUT encodings (most are DOS codepages):
===================================================== CONVERSION MODES Conversion modes ascii, 7bit, and iso are similar to those of dos2unix/unix2dos under SunOS/Solaris. ascii In mode "ascii" only line breaks are converted. This is the default conversion mode. Although the name of this mode is ASCII, which is a 7 bit standard, the actual mode is 8 bit. Use always this mode when converting Unicode UTF-8 files. 7bit In this mode all 8 bit non-ASCII characters (with values from 128 to 255) are converted to a 7 bit space. iso Characters are converted between a DOS character set (code page) and ISO character set ISO-8859-1 (Latin-1) on Unix. DOS characters without ISO-8859-1 equivalent, for which conversion is not possible, are converted to a dot. The same counts for ISO-8859-1 characters without DOS counterpart. When only option "-iso" is used dos2unix will try to determine the active code page. When this is not possible dos2unix will use default code page CP437, which is mainly used in the USA. To force a specific code page use options "-437" (US), "-850" (Western European), "-860" (Portuguese), "-863" (French Canadian), or "-865" (Nordic). Windows code page CP1252 (Western European) is also supported with option "-1252". For other code pages use dos2unix in combination with iconv(1). Iconv can convert between a long list of character encodings. ===================================================== So basically if you specify -iso (or --conv iso) without any of the "input encoding specification" options like -437 etc, then dos2unix will autodetect attempt to detect the *console* encoding. If it succeeds, then it will "convert" character codes from that encoding to their equivalent in ISO-8859-1 ("Latin 1") [unconvertible codes are replaced with an ascii dot] Note that this autodetect, if it works, assumes that the console's CP is the input file's CP. Fair enough -- and it's an overridable default anyway. However, I wonder if, in cygwin-1.7, we actually can/should use the "console codepage" in ANY way. Here's the code: querycp.c: #elif defined (WIN32) || defined(__CYGWIN__) /* Erwin Waterlander */ #include <windows.h> unsigned short query_con_codepage(void) { return((unsigned short)GetConsoleOutputCP()); } #else Or if instead, on cygwin, we should use some other mechanism (locale settings?) to determine the correct default "input" codepage. Comments? -- Chuck -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple