On Sun, Jan 6, 2013 at 6:13 AM, Stas Malyshev <smalys...@sugarcrm.com> wrote: > What is supposed to be in $allowed_html? If those are simple fixed > strings and such, why can't you just do preg_split with > PREG_SPLIT_DELIM_CAPTURE and encode each other element of the result, or > PREG_SPLIT_OFFSET_CAPTURE if you need something more interesting?
I like to start out conservatively, with everything in a "safe" (i.e., fully escaped) state, and then revert from there. If something slips through, it slips through in a conservative state. > I would seriously advise though against trying to do HTML parsing with > regexps unless they are very simple, since browsers will accept a lot of > broken HTML and will happily run scripts in it, etc. I agree, that's why I proposed an improved approach in which regexes would only optionally be used to validate attributes. > I think with level of complexity that is needed to cover anything but > the most primitive cases, you need a full-blown HTML/XML parser there. > Which we do have, so why not use any of them instead of reinventing > them, if that's what you need? I don't think the simple state machine utilized by strip_tags() is a full-blown HTML/XML parser, yet I find it does provide practical value. Merging this type of state machine with the ability to check attributes (via literal string or regex) would be an incremental step beyond what is present now, and would prove practically beneficial. I gave this example as one way to implement this approach through an API: $new = str_escape_html("<a class='important' href='test'>Test</a>", array( 'a' => [ 'href' => '/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?$/', 'class' => 'important' ], 'br' => [] ), "UTF-8"); The idea is that a string would first be escaped using htmlspecialchars, a state machine similar to that used by strip_tags would parse the text for the escaped form of tags. When whitelisted tags are encountered, their attributes are checked against string literals or regexes. If the tag and its attributes match the whitelisted form, the tag sequence is unescaped. One could also augment the strip_tags() function so the whitelist items could include the ability to only allow specific attributes through: $new = strip_tags("<a class='important' href='test'>Test</a>", "<a :class :url>"); The colon-prepended symbols could allow predefined attributes according to a regex. Any unlisted attributes would be stripped. Thank you for the feedback, Stas. Adam -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php