Wouter van Vliet wrote:
characters are being replaced by weird characters. EG: the ' single
quote is being replaced by a question mark

First check you use iso-8859-1 (latin-1) as encoding everywhere, unless you really want to use unicode (utf-8 encoding): - in html pages generated from php <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> - as mysql default encoding (see mysql doc) - as apache default encoding for html pages served (see previous post)

The single quote (') is probably not the standard ascii one,
but the dumb Micro$oft 'smart quote', which uses a code
defined in windows charset 1252, but not in Latin1 (iso-8859-1),
ie in range \x80-\x9f (128-159). See [6] -- search for "cp1252"

Problem come from not respecting standards (latin-1 encoding),
by example when a user fills a html form by cut&paste from M$-word :(
To avoid this, all user supplied datas must be validated,
by first removing/translating invalid chars.

Solution is to convert invalid chars to valid ones.
On a unix/linux/bsd box, perhaps man tr and info recode can help.
Or the cp1252 to Unicode table [5], with interesting bits below
(invalid latin1 chars). This can help you write conversion functions,
like I did for cp1252 to utf8 html (unicode) in [6] with strtr.

0x80    0x20AC  #EURO SIGN
0x81            #UNDEFINED
0x82    0x201A  #SINGLE LOW-9 QUOTATION MARK
0x83    0x0192  #LATIN SMALL LETTER F WITH HOOK
0x84    0x201E  #DOUBLE LOW-9 QUOTATION MARK
0x85    0x2026  #HORIZONTAL ELLIPSIS
0x86    0x2020  #DAGGER
0x87    0x2021  #DOUBLE DAGGER
0x88    0x02C6  #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89    0x2030  #PER MILLE SIGN
0x8A    0x0160  #LATIN CAPITAL LETTER S WITH CARON
0x8B    0x2039  #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8C    0x0152  #LATIN CAPITAL LIGATURE OE
0x8D            #UNDEFINED
0x8E    0x017D  #LATIN CAPITAL LETTER Z WITH CARON
0x8F            #UNDEFINED
0x90            #UNDEFINED
0x91    0x2018  #LEFT SINGLE QUOTATION MARK
0x92    0x2019  #RIGHT SINGLE QUOTATION MARK
0x93    0x201C  #LEFT DOUBLE QUOTATION MARK
0x94    0x201D  #RIGHT DOUBLE QUOTATION MARK
0x95    0x2022  #BULLET
0x96    0x2013  #EN DASH
0x97    0x2014  #EM DASH
0x98    0x02DC  #SMALL TILDE
0x99    0x2122  #TRADE MARK SIGN
0x9A    0x0161  #LATIN SMALL LETTER S WITH CARON
0x9B    0x203A  #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9C    0x0153  #LATIN SMALL LIGATURE OE
0x9D            #UNDEFINED
0x9E    0x017E  #LATIN SMALL LETTER Z WITH CARON
0x9F    0x0178  #LATIN CAPITAL LETTER Y WITH DIAERESIS

Some references:

Code Pages Supported by Windows
[1] http://www.microsoft.com/globaldev/reference/wincp.mspx

Microsoft Windows Codepage : 1252 (Latin I)
[2] http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

Latin 1 (1252)
[3] http://www.microsoft.com/typography/unicode/1252.htm

Latin 1 (1252) Graphic representation
[4] http://www.microsoft.com/typography/unicode/1252.gif

cp1252 to Unicode table
[5] ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

[6] strtr comments
http://www.php.net/manual/en/function.strtr.php

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Reply via email to