On Sat, Jan 5, 2013 at 6:26 PM, Stas Malyshev <smalys...@sugarcrm.com> wrote: > Hi! > >> It's important to escape output according to context. PHP provides >> functions such as htmlspecialchars() to escape output when the context >> is HTML. However, one often desires to allow some subset of HTML >> through without escaping (e.g., <br />, <b></b>, etc.) > > I think what you are looking for is HtmlPurifier and such. Doing it in > the core properly would be pretty hard.
Hi Stas, HtmlPurifier is a fantastic tool, but it offers far more than I typically need/want (e.g., CSS validation, attempting to remove malicious code, fixes to illegal nesting, etc.) I guess I'm wondering about something that allows more through than htmlspecialchars() through some form of whitelisting, but not as much through as strip_tags() (as noted, attributes can be problematic.) >> https://github.com/AdamJonR/nephtali-php-ext/blob/master/nephtali.c >> > > Could you describe in detail what that function actually does, with > examples? Sure! Function: // Example userland code commented to explain the flow: function str_escape_html($string, $allowed_html = array(), $charset = 'UTF-8') { // use htmlspecialchars because it only does the 5 most important chars, and doesn't mess them up // start out safely by ensureing everything is html escaped (whitelisting approache) $escaped_string = htmlspecialchars($string, ENT_QUOTES, $charset); // check if there are whitelisted sequences which, if present, we can safely revert if ($allowed_html) { // cycle through the whitelisted sequences foreach($allowed_html as $sequence) { // Save escaped version of sequence so we know what to revert safely // This also works for regexes fairly well because <, >, &, ', ", don't have special meaning in regexes, but character sets cause trouble, something I've just learned to work around // http://php.net/manual/en/regexp.reference.meta.php $escaped_sequence = htmlspecialchars($sequence, ENT_QUOTES, $charset); // if the sequence begins and ends with a '/', treat it as a regex if (($sequence[0] == '/') && ($sequence[strlen($sequence) - 1] == '/')) { // revert regex matches $escaped_string = preg_replace_callback($escaped_sequence, function($matches){return html_entity_decode($matches[0]);}, $escaped_string); // otherwise, treat it as a standard string sequence } else { // revert string sequences $escaped_string = str_replace($escaped_sequence, $sequence, $escaped_string); } } } return $escaped_string; } $input = '<div class="expected"><b onclick="alert(\'Oh no!\')">click me</b><br id="whyIdMe" /><b class="emphasize">do not click the other bolded text</b></div>'; $draconian_bold_tag_regex = '/<b( class="([a-z]+)")?>[a-zA-Z_.,! ]+<\/b>/'; echo "strip tags: " . strip_tags($input, '<b><div><br>') . "<br /><br />"; echo "htmlspecialchars: " . htmlspecialchars($input, ENT_QUOTES, 'UTF-8') . "<br /><br />"; echo "str_escape_html: " . str_escape_html($input, array($draconian_bold_tag_regex, '<br />', '<div class="expected">', '</div>'), 'UTF-8'); Problems with above implementation: Regexing large chunks of HTML is a pain, and easy to get wrong. Additionally, you have to enter both the opening and closing tag in the whitelist separately for literals, and character sets can break things due to the escaping (e.g., [^<>].) Solutions: What I'm looking to build is functionality that adds to htmlspecialchars by implementing whitelisting through some primitive parsing (identify the tags without regard for validating HTML, similar to strip_tags) and some regexing (validate attribute contents) using a similar approach to the above function. The new function would better break up the tag components, making the required regexes much easier to work with. For example: $new = str_escape_html("<a class='important' href='test'>Test</a>", array( 'a' => [ 'href' => '/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/' // strings beginning and ending with '/' are considered regexes 'class' => 'important' // other strings are just evaluated as literals ], 'br' => [] // has no allowed attributes ), "UTF-8"); Conclusion: Bridging the gap between strip_tags and htmlspecialchars seems like a reasonable consideration for PHP's core. While I do use HTMLPurifier or other tools for more involved HTML processing, I often wish the core had something just a bit more flexible than htmlspecialchars, but just a bit more protective than strip_tags that I could use for my typical escaping needs. I'm going to spend some time refining this approach in an extension, but I was looking for feedback before proceeding further. Thanks, Adam -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php