Hi everyone, We're having a debate (in the comment section of this PR <https://github.com/apache/commons-text/pull/310>) on the legitimacy of unescaping semicolon-less numerical character entities in Commons-Text.
The possibility to unescape such entities has long been part of the library, via the semiColonOptional <https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/NumericEntityUnescaper.java#L48> option in the NumericEntityUnescaper class. While testing this option, I discovered a small bug which allows to bypass the unescaper. A string like this: *<iframe src="javascript:alert(1)">* is ignored by the unescaper, because even though this entity is a decimal one, the algorithm searches for hexidecimal characters in all cases and includes the "a" after the "6". This prompted me to fix it in this commit <https://github.com/apache/commons-text/pull/310/commits/05280c2d474fce08bfb19cc2178949e5d384c999> and open the PR. However, as mentioned earlier, there is a debate on the legitimacy of unescaping semicolon-less from the beginning. The point of garydgregory is that such entities do not form part of the HTML specification and as such Commons-Text should not consider it. My point and kinow's however, is that these semicolon-less entities are unescaped by virtually every modern browsers (tested with Chrome, Firefox, Edge and Safari) and that Commos-Text could reasonnably expect the library to support them. Also, I pointed that my fix only makes the unsecaping work correctly with decimal entities, so in my opinion the PR shouldn't be blocked by the debate. What's your opinion about it ? Thanks ! Richard https://github.com/apache/commons-text/pull/310