Re: [PHP-DEV] [RFC] Working With Substrings

Thomas Hruska Wed, 15 Feb 2023 11:35:30 -0800

On 2/15/2023 6:03 AM, Lydia de Jongh wrote:

Hi,
Very interesting topic! On which I have NO experience 🙈



In some other languages every variable IS an object..... by default.

As far as I understand, the code above is meant as internal.
But what if any variable is a small object.
Has this been ever considered? Or would it use too much performance?

$oString = 'my text';

$oString->toUpper();

echo $oString;  // 'MY TEXT'

The above represents a significant amount of scope creep but it'scertainly interesting. So let's explore it a bit and gauge the response.

The above code will currently throw an error. Significant globaladoption of such a change will take a fairly long time - probably adecade, maybe longer.

AFAIK, there is nothing technically preventing the core Zend engine fromaccepting a -> token after a string variable and calling a function thatperforms an inline modification of the string.

As a brief test, I just ran the example code through PHP and got: "PHPFatal error: Uncaught Error: Call to a member function toUpper() onstring in test.php:4" The error message shows that Zend engine clearlyalready recognizes toUpper() as an attempted function/method call on astring...it just doesn't know what to do with it. So the logic forsupporting -> method calls on strings appears, at least from my verybrief test, to already be mostly in place. Nice!

Supporting this would likely result in two distinct internal functionsthat would have to be maintained. One inline string-object methodvariant that can avoid copy-on-write (e.g. $var->toUpper()) and one thatonly does copy-on-write (e.g. strtoupper()). Repeat that for all of theexisting string functions. Alternatively, the main function body foreach function could move into its own function that has a parameter fordistinguishing the difference between "function (copy) vs. method(possibly inline)" calls, which would create some additional overheadfor the existing ext/standard/string functions. The average performanceloss for regular function calls would need to be benchmarked. Nobodylikes seeing performance losses even if they end up being a less than 1%reduction. C function calls are way faster than PHP userland but theystill have some overhead. This is just a thought exploration of how itcould be implemented.

With this approach, a $var->repeat("\x00", 4096, 50) could work to startat position 50 and write 4,096 zero bytes. But that again adds aparameter for an offset. But maybe $var[50...4096 + 50]->repeat("\x00",4096) could solve that? That's a bit awkward to look at, requiresadding range support to strings (and maybe arrays too because you knowsomeone will want that as well), and probably breaks a lot of things.

However, I'm not sure this idea can be used with virtual buffers thatexpressly set their size. zend_string (how strings are stored) simplydoesn't have support for it. There's a length member but no sizemember. Internally, the zend_string implementation assumes length + 1 =size.

If you got this far and know how PHP, C, and CPU hardware works, you canskip ahead to the last two paragraphs. The next few paragraphs delvesinto some details to try to explain to Lydia (and others who arefollowing along) what's going on under the hood with why I focused onsubstrings. Apologies in advance for my rambling.

Avoiding copy-on-write requires the internal reference count total(refcount) to effectively be 1. Reference counting helps reduce thenumber of times a copy is made. Fewer copies generally results infaster performance. A refcount of 1 does happen more frequently wheninside a loop. In real world code, depending on what is being done, thefirst loop iteration might have many references to a string while thesecond loop iteration that is operating on the same data might have arefcount of just one. This situation happens frequently enough toconsider inline options.

Memory allocation is one of the slower operations in computer programs.Ideally, a program makes as few allocation requests to the system aspossible. PHP avoids making system calls to allocate memory by poolingreclaimed memory into multiple memory pools for reuse. Copying stringsfrom one buffer to another buffer is also avoided by leveragingreference counting. However, this creates the scenario where everymodified string has its buffer copied from one buffer to the next.Let's take this fairly common but simple code to see what happens inZend engine:


$pos = strrpos($str, "/");
$str = substr($str, 0, $pos + 1);

The above substr() results in one "logical" memory allocation and onelogical free operation (whether it actually makes system calls toallocate/free memory is way beyond the scope of this paragraph) and onememory copy operation. We say we want the substring of a certain size,which allocates space to create a temporary copy that can hold thatstring. Then the data is copied from one buffer to another buffer.Then we assign the temporary copy to the original input variable. Thatcauses the original value, assuming nothing else is referencing it (akaa refcount of 0), to eventually re-enter the memory pool for futureallocations and assigns the temporary to the variable. All of that isdone transparently to the user so the user generally doesn't have toworry about memory allocation strategies. There's no good way to detectthis situation to optimize it, although I'm sure the JIT does try to doso on some level when it is enabled. As a side-effect, there are alsono built-in tools currently available to care about memory allocationstrategies for individual allocations when the need does arise. Thereare some controls for managing garbage collection but those have globalimpact.

Doing that operation one time is fast enough and not really a problem.Doing it 1,000,000 times in a loop is where we end up constantly copyingmemory around when we could potentially work on the same memory bufferthe entire time. We still might end up using the same memory buffersover and over due to recycling them through the PHP memory pool, whichmeans the buffers might get to sit in the L1 or L2 cache in the CPU, butit does leave some performance on the table because copying a buffer orportions of it repeatedly can be an unnecessary operation. Buffers thatare larger than the CPU's cache line sizes are going to suffer the mostbecause there will be constant requests to main memory for theinformation that the CPU needs to modify and will constantly flush thecache lines and stall out while waiting for more data to arrive. That'snot exactly optimal/ideal. Modifying the same buffer inline will bemore likely stay in the L1 and L2 cache lines and therefore be muchcloser to the CPU core, resulting in notably faster performance.

Pointers in C are much faster than copying memory. The problem isexposing pointers to userland, especially in Internet-facing software.Pointers are notoriously unsafe - just look at the zillion bufferoverflow vulnerabilities (CVEs) that are reported annually across allsoftware products. Copy-on-write, by comparison, is a much saferoperation at the cost of performance. However, pointers let us justpoint at a substring or general chunk of memory instead of copying it,which significantly reduces the overhead since pointers are simpleinteger values that contain a memory address. And those values aresmall enough to sit in CPU registers, which are blazing fast. CPUs onlyhave a handful of registers though because each register dramaticallyincreases the cost of the CPU die. So if we can just point at thememory we want to "extract" instead of actually copying the data intoits own string object, we can potentially save a ton of CPU cycles,especially when working with data inside a loop.

Overall, I think substrings offer the most obvious/apparent area forperformance gains and probably have, implementation details aside, theleast amount of friction. But maybe we should consider the largerecosystem of string functions as well? Or should this just be apossible longer term idea that requires more thought and research andthus the scope should be limited and we put Lydia's idea under FutureScope in the RFC? Other thoughts/comments?


Added as Open Issue 10 to the RFC.  Thank you for your input.

--
Thomas Hruska
CubicleSoft President

CubicleSoft has over 80 original open source projects and counting.
Plus a couple of commercial/retail products.

What software are you looking to build?

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] Working With Substrings

Reply via email to