Hello,

Given the absence of answers, I thought I might add another argument to my
point.

While I didn't mention it in my first post, please consider that the
semiColonOptional
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48>
option
is never activated by default.
The constructor
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L79>
of
the NumericEntityUnescaper class uses by default the DEFAULT_OPTION
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L57>
which
is set to the semiColonRequired
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L43>
 value.

Then, the UNESCAPE_HTML4
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/StringEscapeUtils.java#L440>
translator
in the StringEscapeUtils class, which is in turn used by the unescapeHtml4
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/StringEscapeUtils.java#L791>
method,
indeed calls the aforementioned constructor.
All of this is consistent with the fact that, by default, the Commons Text
library requires semicolons to be used after numeric HTML entities, and as
such follows strictly the HTML specification.

However, the semiColonOptional
<https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48>
option
does exist and may be used in certain ways. Therefore, I think we can
expect it to work correctly if used with decimal entities.
That's the point of my commit, nothing more than that.
I hope this little explanation might help you make your opinion.

Thanks in advance,
Richard

Le mar. 10 mai 2022 à 09:48, Richard BUNEL <rbune...@gmail.com> a écrit :

> Hi everyone,
>
> We're having a debate (in the comment section of this PR
> <https://github.com/apache/commons-text/pull/310>) on the legitimacy of
> unescaping semicolon-less numerical character entities in Commons-Text.
>
> The possibility to unescape such entities has long been part of the
> library, via the semiColonOptional
> <https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48>
>  option
> in the NumericEntityUnescaper class.
>
> While testing this option, I discovered a small bug which allows to bypass
> the unescaper.
> A string like this: *<iframe src="&#106avascript:alert(1)">* is
> ignored by the unescaper, because even though this entity is a decimal one,
> the algorithm searches for hexidecimal characters in all cases and includes
> the "a" after the "6".
> This prompted me to fix it in this commit
> <https://github.com/apache/commons-text/pull/310/commits/05280c2d474fce08bfb19cc2178949e5d384c999>
>  and
> open the PR.
>
> However, as mentioned earlier, there is a debate on the legitimacy of
> unescaping semicolon-less from the beginning.
>
> The point of garydgregory is that such entities do not form part of the
> HTML specification and as such Commons-Text should not consider it.
>
> My point and kinow's however, is that these semicolon-less entities are
> unescaped by virtually every modern browsers (tested with Chrome, Firefox,
> Edge and Safari) and that Commos-Text could reasonnably expect the library
> to support them.
>
> Also, I pointed that my fix only makes the unsecaping work correctly with
> decimal entities, so in my opinion the PR shouldn't be blocked by the
> debate.
>
> What's your opinion about it ?
>
> Thanks !
> Richard
>
> https://github.com/apache/commons-text/pull/310
>

Reply via email to