On Mon, Jul 5, 2021 at 8:15 PM Craig Francis <cr...@craigfrancis.co.uk> wrote:
> Hi Internals, > > I have opened voting on https://wiki.php.net/rfc/is_literal for the > is-literal function. > > The vote closes 2021-07-19 > > The proposal is to add the function is_literal(), a simple way to identify > if a string was written by a developer, removing the risk of a variable > containing an Injection Vulnerability. > > This implementation is for literal strings ONLY (after discussion over > allowing integers) and, thanks to the amazing work of Joe Watkins, now > works fully with compiler optimisations, interned strings etc. > The new implementation, while being very predictable, also comes at a cost. The RFC doesn't really explain how this works, so let me explain it here (based on a cursory look through the patch): Strings have a new flag that indicates whether they are "literal". The problem is that PHP performs string interning, which means that certain strings, mostly those seen during compilation of PHP code, are deduplicated. If you write "foo" two times in a piece of PHP code, there will only be one interned string "foo". Now, what happens if there is both a literal string "foo" and a non-literal string "foo" and we want to intern both? I believe what the original implementation did is to just assume that all interned strings are literal strings (this is an educated guess, I don't know where to find the old implementation). This makes things technically very simple, and is a pretty sensible assumption in practice. The reason is that interned strings are almost always derived from literal strings in the program. The main exception to that are a) single character strings, where a lot of code will prefer fetching an interned string rather than allocating a new one and b) the result of optimizations in conjunction with opcache, which will render strings like $a=1; $b="Foo".$a; interned as a result of constant propagation. What the new implementation does is to actually allow two separate interned strings for "foo"(literal) and "foo"(not literal). Part of the hashtable implementation needs to be duplicated, because it now needs to distinguish strings that are nominally equal. The complexity involved here doesn't look terrible, but I'm unsure about the performance and memory usage impact. Where per-request strings are concerned, I expect duplication to be low, because most per-request interned strings are going to be literal. However, I don't think this holds for permanent interned strings, which will be predominantly (or entirely?) non-literal. At least it doesn't look to me like names of internal functions/classes/etc are considered literal. I think this may be problematic, in that now the permanent non-literal interned string in the function/class tables will be different from the per-request literal interned string used by user code to reference it. This means that all of these are going to be duplicated, and will no longer benefit from efficient identity comparison, and will have to be compared based on string contents instead. I'm not sure what the actual impact on performance would be. The RFC indicates that the impact is minor, but I believe those measurements were made with the original version of the RFC, which did not try to distinguish literal and non-literal interned strings. Overall, this is a no vote from my side. While I was not entirely convinced by the RFC, I was willing to accept the original approach based on sheer simplicity -- tracking the "probably literal" flag didn't really cost us anything in terms of either technical complexity or performance. With the new implementation this isn't the case anymore. I could be convinced that the technical implications here are unproblematic based on further analysis, but the current RFC doesn't really discuss this point at all. Regards, Nikita