You might look at the htmlentities() function...

On Tue, 29 Oct 2002, a.h.s. boy wrote:

> I'm working on a PHP-based CMS that allows users to post lengthy
> article texts by submitting through a form. The short version of my
> quandary is this: How can I create a conversion routine that reliably
> substitutes HTML-acceptable output for high-ASCII characters pasted
> into the form (from a variety of operating systems)?
>
> The longer version is this:
> In order to prevent scripting vulnerabilities and a variety of other
> undesirable content, I run the body of the text through a cleantext()
> function. This function first strips out illegal HTML tags and
> JavaScript. So far so good.
>
> Then it attempts to perform some character conversions to clean up
> 8-bit ASCII characters in the text, so smart quotes, en- and em-dashes,
> ellipses, etc. are converted to suitable alternative, or to HTML
> entities. I'm using:
>
> // Reference:
> // chr(133) = ellipsis
> // chr(145) = left curly single quote
> // chr(146) = right curly single quote (apostrophe)
> // chr(147) = left curly double quote
> // chr(148) = right curly double quote
> // chr(149) = bullet
> // chr(150) = en dash
> // chr(151) = em dash
> // chr(153) = trademark
> // chr(160) = non-breaking space
> // chr(161) = inverted exclamation mark
> // chr(169) = copyright symbol
> // chr(171) = left guillemet
> // chr(173) = soft hyphen
> // chr(174) = registered trademark
> // chr(187) = right guillemet
> // chr(188) = 1/4 fraction
> // chr(189) = 1/2 fraction
> // chr(190) = 3/4 fraction
> // chr(191) = inverted question mark
> $changearr = array(" "=>" ",
>       "\r"=>"\n",
>       "\r\n"=>"\n",
>       "\n\n\n" => "\n\n",
>       chr(133)=>"...",
>       chr(145)=>"'",
>       chr(146)=>"'",
>       chr(147)=>"\"",
>       chr(148)=>"\"",
>       chr(149)=>"*",
>       chr(150)=>"-",
>       chr(151)=>"--",
>       chr(153)=>"(TM)",
>       chr(160)=>" ",
>       chr(161)=>"¡",
>       chr(169)=>"©",
>       chr(171)=>"«",
>       chr(173)=>"-",
>       chr(174)=>"(R)",
>       chr(187)=>"»",
>       chr(188)=>"1/4",
>       chr(189)=>"1/2",
>       chr(190)=>"3/4",
>       chr(191)=>"¿");
> $returnstr = strtr($returnstr,$changearr);
>
> The server's on a Linux box (RedHat 7.2, standard US installation);
> users can obviously post from any sort of operating system.
>
> This routine seems to work well on Word text pasted in from my Mac (OS
> X 10.2.1), but I see a number of articles appearing on the site with
> text like:
>
> Wouldnâ€(TM)t you say?
>
> (That's "Wouldn[a circumflex][Euro symbol](TM)t" instead of "Wouldn't".
>
> ...which was almost definitely pasted in from a Windows-based Microsoft
> Word, and the conversion routines are failing. (And inserting even
> weirder characters...why would the single quote be replace by _3_
> character substitutions?)
>
> I understand that Windows may well use a different character set for
> high-ASCII, but I frankly don't understand how to work that knowledge
> into this situation. And the combination of original text, Linux ,
> chr(), and ord() stuff just doesn't make sense to me. For example, if I
> post text (from my Mac) containing only:
>
> “”‘’…
> (that's
> [open-double-quote][close-double-quote][open-single-quote][close-
> single-quote][ellipsis])
>
> and have PHP run this:
>
> for ($x = 0; $x < strlen($str); $x++) {
>     $mailstr .= $str[$x].' is '.ord($str[$x])."\n";
> }
> mail('me','Characters',$mailstr);
>
> I get mail that says (in parentheses is a description of the character):
>
> ì is 147 (accent-grave-i)
> î is 148 (circumflex-i)
> ë is 145 (umlaut-e)
> í is 146 (accent-acute-i)
> Ö is 133 (umlaut capital o)
>
> ...which means that "recognizes" the correct ASCII value (147) of a
> double-quote, though my Linux box seems to think that the character is
> a lowercase "i" with a grave accent on it. With this kind of strange
> sub-conversion going on, I'm not all that surprised that things are
> getting mucked up.
>
> Is there some way of getting pasted Word text from Windows "clean" in
> this manner, as well as accommodating the already-working-right Mac
> Word text?
>
> Cheers,
> spud.
>
> -------------------------------------------------------------
> a.h.s. boy
> [EMAIL PROTECTED]
> dadaIMC support
> http://www.dadaimc.org/
> -------------------------------------------------------------
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to