Wouter van Vliet wrote:
characters are being replaced by weird characters. EG: the ' single
quote is being replaced by a question mark
First check you use iso-8859-1 (latin-1) as encoding everywhere,
unless you really want to use unicode (utf-8 encoding):
- in html pages generated from php
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
- as mysql default encoding (see mysql doc)
- as apache default encoding for html pages served (see previous post)
The single quote (') is probably not the standard ascii one,
but the dumb Micro$oft 'smart quote', which uses a code
defined in windows charset 1252, but not in Latin1 (iso-8859-1),
ie in range \x80-\x9f (128-159). See [6] -- search for "cp1252"
Problem come from not respecting standards (latin-1 encoding),
by example when a user fills a html form by cut&paste from M$-word :(
To avoid this, all user supplied datas must be validated,
by first removing/translating invalid chars.
Solution is to convert invalid chars to valid ones.
On a unix/linux/bsd box, perhaps man tr and info recode can help.
Or the cp1252 to Unicode table [5], with interesting bits below
(invalid latin1 chars). This can help you write conversion functions,
like I did for cp1252 to utf8 html (unicode) in [6] with strtr.
0x80 0x20AC #EURO SIGN
0x81 #UNDEFINED
0x82 0x201A #SINGLE LOW-9 QUOTATION MARK
0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK
0x85 0x2026 #HORIZONTAL ELLIPSIS
0x86 0x2020 #DAGGER
0x87 0x2021 #DOUBLE DAGGER
0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89 0x2030 #PER MILLE SIGN
0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON
0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8C 0x0152 #LATIN CAPITAL LIGATURE OE
0x8D #UNDEFINED
0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON
0x8F #UNDEFINED
0x90 #UNDEFINED
0x91 0x2018 #LEFT SINGLE QUOTATION MARK
0x92 0x2019 #RIGHT SINGLE QUOTATION MARK
0x93 0x201C #LEFT DOUBLE QUOTATION MARK
0x94 0x201D #RIGHT DOUBLE QUOTATION MARK
0x95 0x2022 #BULLET
0x96 0x2013 #EN DASH
0x97 0x2014 #EM DASH
0x98 0x02DC #SMALL TILDE
0x99 0x2122 #TRADE MARK SIGN
0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON
0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9C 0x0153 #LATIN SMALL LIGATURE OE
0x9D #UNDEFINED
0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON
0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS
Some references:
Code Pages Supported by Windows
[1] http://www.microsoft.com/globaldev/reference/wincp.mspx
Microsoft Windows Codepage : 1252 (Latin I)
[2] http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
Latin 1 (1252)
[3] http://www.microsoft.com/typography/unicode/1252.htm
Latin 1 (1252) Graphic representation
[4] http://www.microsoft.com/typography/unicode/1252.gif
cp1252 to Unicode table
[5] ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
[6] strtr comments
http://www.php.net/manual/en/function.strtr.php
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php