You might look at the htmlentities() function... On Tue, 29 Oct 2002, a.h.s. boy wrote:
> I'm working on a PHP-based CMS that allows users to post lengthy > article texts by submitting through a form. The short version of my > quandary is this: How can I create a conversion routine that reliably > substitutes HTML-acceptable output for high-ASCII characters pasted > into the form (from a variety of operating systems)? > > The longer version is this: > In order to prevent scripting vulnerabilities and a variety of other > undesirable content, I run the body of the text through a cleantext() > function. This function first strips out illegal HTML tags and > JavaScript. So far so good. > > Then it attempts to perform some character conversions to clean up > 8-bit ASCII characters in the text, so smart quotes, en- and em-dashes, > ellipses, etc. are converted to suitable alternative, or to HTML > entities. I'm using: > > // Reference: > // chr(133) = ellipsis > // chr(145) = left curly single quote > // chr(146) = right curly single quote (apostrophe) > // chr(147) = left curly double quote > // chr(148) = right curly double quote > // chr(149) = bullet > // chr(150) = en dash > // chr(151) = em dash > // chr(153) = trademark > // chr(160) = non-breaking space > // chr(161) = inverted exclamation mark > // chr(169) = copyright symbol > // chr(171) = left guillemet > // chr(173) = soft hyphen > // chr(174) = registered trademark > // chr(187) = right guillemet > // chr(188) = 1/4 fraction > // chr(189) = 1/2 fraction > // chr(190) = 3/4 fraction > // chr(191) = inverted question mark > $changearr = array(" "=>" ", > "\r"=>"\n", > "\r\n"=>"\n", > "\n\n\n" => "\n\n", > chr(133)=>"...", > chr(145)=>"'", > chr(146)=>"'", > chr(147)=>"\"", > chr(148)=>"\"", > chr(149)=>"*", > chr(150)=>"-", > chr(151)=>"--", > chr(153)=>"(TM)", > chr(160)=>" ", > chr(161)=>"¡", > chr(169)=>"©", > chr(171)=>"«", > chr(173)=>"-", > chr(174)=>"(R)", > chr(187)=>"»", > chr(188)=>"1/4", > chr(189)=>"1/2", > chr(190)=>"3/4", > chr(191)=>"¿"); > $returnstr = strtr($returnstr,$changearr); > > The server's on a Linux box (RedHat 7.2, standard US installation); > users can obviously post from any sort of operating system. > > This routine seems to work well on Word text pasted in from my Mac (OS > X 10.2.1), but I see a number of articles appearing on the site with > text like: > > Wouldnâ€(TM)t you say? > > (That's "Wouldn[a circumflex][Euro symbol](TM)t" instead of "Wouldn't". > > ...which was almost definitely pasted in from a Windows-based Microsoft > Word, and the conversion routines are failing. (And inserting even > weirder characters...why would the single quote be replace by _3_ > character substitutions?) > > I understand that Windows may well use a different character set for > high-ASCII, but I frankly don't understand how to work that knowledge > into this situation. And the combination of original text, Linux , > chr(), and ord() stuff just doesn't make sense to me. For example, if I > post text (from my Mac) containing only: > > “”‘’… > (that's > [open-double-quote][close-double-quote][open-single-quote][close- > single-quote][ellipsis]) > > and have PHP run this: > > for ($x = 0; $x < strlen($str); $x++) { > $mailstr .= $str[$x].' is '.ord($str[$x])."\n"; > } > mail('me','Characters',$mailstr); > > I get mail that says (in parentheses is a description of the character): > > ì is 147 (accent-grave-i) > î is 148 (circumflex-i) > ë is 145 (umlaut-e) > í is 146 (accent-acute-i) > Ö is 133 (umlaut capital o) > > ...which means that "recognizes" the correct ASCII value (147) of a > double-quote, though my Linux box seems to think that the character is > a lowercase "i" with a grave accent on it. With this kind of strange > sub-conversion going on, I'm not all that surprised that things are > getting mucked up. > > Is there some way of getting pasted Word text from Windows "clean" in > this manner, as well as accommodating the already-working-right Mac > Word text? > > Cheers, > spud. > > ------------------------------------------------------------- > a.h.s. boy > [EMAIL PROTECTED] > dadaIMC support > http://www.dadaimc.org/ > ------------------------------------------------------------- > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php