https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112652

--- Comment #10 from ro at CeBiTec dot Uni-Bielefeld.DE <ro at CeBiTec dot 
Uni-Bielefeld.DE> ---
> --- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to r...@cebitec.uni-bielefeld.de from comment #8)
>> FWIW, the iconv conversion tables in /usr/lib/iconv can be regenerated
>> from the OpenSolaris sources, modified not to do that '?' conversion.
>> Worked for a quick check for the UTF-8 -> ASCII example, but the '?' is
>> more prevalent and would need to be eradicated upstream.
>
> If it is always '?' used instead of unknown character, we could also have some
> hack on the libcpp side for it.

It took me a bit to get back to you here since I had to check with both
Solaris engineering and dig up our old Solaris 9 sources (which, unlike,
OpenSolaris, have no relevant parts missing due to copyright issues).

Both what I found in the iconv conversion tables and what's documented
in unicode_iconv(7) confirms the consistent use of '?'.  The manpage has

       If the source character code value is not within a range defined by the
       source  codeset  standard, it is considered as an illegal character. If
       the source character code value is within the range(s) defined  by  the
       standard,  it  will  be considered as non-identical, even if the source
       character code value maps to an undefined or a reserved location within
       the valid range. The non-identical character will map to either ? (0x3f
       in ASCII-compatible codesets) if the target codeset  is  a  non-Unicode
       codeset  or  to  Unicode  replacement  character (U+FFFD) if the target
       codeset is an Unicode codeset.

It will of course be in the respective charset's encoding (0x3f for
ASCII, 0x6f for EBCDIC), but that's all I could find.  This is not a
complete guarantee (I may well have missed something), but seems
plausible enough...

> Like (but limited to Solaris hosts) in convert_using_iconv when converting 
> from
> SOURCE_CHARSET to some other character set don't try to convert the whole 
> UTF-8
> string at once, but split it into chunks at u'?' characters, so
> foo???bar?baz?qux
> would be iconv converted as
> foo
> ???
> bar
> ?
> baz
> ?
> qux
> chunks.  And when converting the non-? chunks, it would after the conversion
> check for the '?' character (in the destination character set - that is
> something that perhaps could be queried during initialization after 
> iconv_open)
> and treat it as an error if it appeared there.  Or always convert also back to
> UTF-8 and check if it has more '?' characters than the source.

Unless we want to take the easy way out and just require GNU libiconv on
Solaris, that seems like a plausible way of handling the issue.

Reply via email to