On 2/15/2023 6:03 AM, Lydia de Jongh wrote:
Hi,
Very interesting topic! On which I have NO experience 🙈
In some other languages every variable IS an object..... by default.
As far as I understand, the code above is meant as internal.
But what if any variable is a small object.
Has this been ever considered? Or would it use too much performance?
$oString = 'my text';
$oString->toUpper();
echo $oString; // 'MY TEXT'
The above represents a significant amount of scope creep but it's
certainly interesting. So let's explore it a bit and gauge the response.
The above code will currently throw an error. Significant global
adoption of such a change will take a fairly long time - probably a
decade, maybe longer.
AFAIK, there is nothing technically preventing the core Zend engine from
accepting a -> token after a string variable and calling a function that
performs an inline modification of the string.
As a brief test, I just ran the example code through PHP and got: "PHP
Fatal error: Uncaught Error: Call to a member function toUpper() on
string in test.php:4" The error message shows that Zend engine clearly
already recognizes toUpper() as an attempted function/method call on a
string...it just doesn't know what to do with it. So the logic for
supporting -> method calls on strings appears, at least from my very
brief test, to already be mostly in place. Nice!
Supporting this would likely result in two distinct internal functions
that would have to be maintained. One inline string-object method
variant that can avoid copy-on-write (e.g. $var->toUpper()) and one that
only does copy-on-write (e.g. strtoupper()). Repeat that for all of the
existing string functions. Alternatively, the main function body for
each function could move into its own function that has a parameter for
distinguishing the difference between "function (copy) vs. method
(possibly inline)" calls, which would create some additional overhead
for the existing ext/standard/string functions. The average performance
loss for regular function calls would need to be benchmarked. Nobody
likes seeing performance losses even if they end up being a less than 1%
reduction. C function calls are way faster than PHP userland but they
still have some overhead. This is just a thought exploration of how it
could be implemented.
With this approach, a $var->repeat("\x00", 4096, 50) could work to start
at position 50 and write 4,096 zero bytes. But that again adds a
parameter for an offset. But maybe $var[50...4096 + 50]->repeat("\x00",
4096) could solve that? That's a bit awkward to look at, requires
adding range support to strings (and maybe arrays too because you know
someone will want that as well), and probably breaks a lot of things.
However, I'm not sure this idea can be used with virtual buffers that
expressly set their size. zend_string (how strings are stored) simply
doesn't have support for it. There's a length member but no size
member. Internally, the zend_string implementation assumes length + 1 =
size.
If you got this far and know how PHP, C, and CPU hardware works, you can
skip ahead to the last two paragraphs. The next few paragraphs delves
into some details to try to explain to Lydia (and others who are
following along) what's going on under the hood with why I focused on
substrings. Apologies in advance for my rambling.
Avoiding copy-on-write requires the internal reference count total
(refcount) to effectively be 1. Reference counting helps reduce the
number of times a copy is made. Fewer copies generally results in
faster performance. A refcount of 1 does happen more frequently when
inside a loop. In real world code, depending on what is being done, the
first loop iteration might have many references to a string while the
second loop iteration that is operating on the same data might have a
refcount of just one. This situation happens frequently enough to
consider inline options.
Memory allocation is one of the slower operations in computer programs.
Ideally, a program makes as few allocation requests to the system as
possible. PHP avoids making system calls to allocate memory by pooling
reclaimed memory into multiple memory pools for reuse. Copying strings
from one buffer to another buffer is also avoided by leveraging
reference counting. However, this creates the scenario where every
modified string has its buffer copied from one buffer to the next.
Let's take this fairly common but simple code to see what happens in
Zend engine:
$pos = strrpos($str, "/");
$str = substr($str, 0, $pos + 1);
The above substr() results in one "logical" memory allocation and one
logical free operation (whether it actually makes system calls to
allocate/free memory is way beyond the scope of this paragraph) and one
memory copy operation. We say we want the substring of a certain size,
which allocates space to create a temporary copy that can hold that
string. Then the data is copied from one buffer to another buffer.
Then we assign the temporary copy to the original input variable. That
causes the original value, assuming nothing else is referencing it (aka
a refcount of 0), to eventually re-enter the memory pool for future
allocations and assigns the temporary to the variable. All of that is
done transparently to the user so the user generally doesn't have to
worry about memory allocation strategies. There's no good way to detect
this situation to optimize it, although I'm sure the JIT does try to do
so on some level when it is enabled. As a side-effect, there are also
no built-in tools currently available to care about memory allocation
strategies for individual allocations when the need does arise. There
are some controls for managing garbage collection but those have global
impact.
Doing that operation one time is fast enough and not really a problem.
Doing it 1,000,000 times in a loop is where we end up constantly copying
memory around when we could potentially work on the same memory buffer
the entire time. We still might end up using the same memory buffers
over and over due to recycling them through the PHP memory pool, which
means the buffers might get to sit in the L1 or L2 cache in the CPU, but
it does leave some performance on the table because copying a buffer or
portions of it repeatedly can be an unnecessary operation. Buffers that
are larger than the CPU's cache line sizes are going to suffer the most
because there will be constant requests to main memory for the
information that the CPU needs to modify and will constantly flush the
cache lines and stall out while waiting for more data to arrive. That's
not exactly optimal/ideal. Modifying the same buffer inline will be
more likely stay in the L1 and L2 cache lines and therefore be much
closer to the CPU core, resulting in notably faster performance.
Pointers in C are much faster than copying memory. The problem is
exposing pointers to userland, especially in Internet-facing software.
Pointers are notoriously unsafe - just look at the zillion buffer
overflow vulnerabilities (CVEs) that are reported annually across all
software products. Copy-on-write, by comparison, is a much safer
operation at the cost of performance. However, pointers let us just
point at a substring or general chunk of memory instead of copying it,
which significantly reduces the overhead since pointers are simple
integer values that contain a memory address. And those values are
small enough to sit in CPU registers, which are blazing fast. CPUs only
have a handful of registers though because each register dramatically
increases the cost of the CPU die. So if we can just point at the
memory we want to "extract" instead of actually copying the data into
its own string object, we can potentially save a ton of CPU cycles,
especially when working with data inside a loop.
Overall, I think substrings offer the most obvious/apparent area for
performance gains and probably have, implementation details aside, the
least amount of friction. But maybe we should consider the larger
ecosystem of string functions as well? Or should this just be a
possible longer term idea that requires more thought and research and
thus the scope should be limited and we put Lydia's idea under Future
Scope in the RFC? Other thoughts/comments?
Added as Open Issue 10 to the RFC. Thank you for your input.
--
Thomas Hruska
CubicleSoft President
CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.
What software are you looking to build?
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php