Hello, internals. As Rowan Collins suggested i've replaced lookup table with simple macros: #define UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0xFC00 == 0xD800) #define UTF16_BE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0x00FC == 0x00D8)
I repeated the benchmarks again. Here is the results: String foobar was repeated 1000000 times. Result string size is 11.4mb mb_str_split(): string was splitted by 50 into 120000 chunks 1 in 0.400670 s mb_str_split_utf16(): string was splitted by 50 into 120000 chunks 1 in 0.038947 s I satisfied my research interest. The question is there practical value? Interested in your opinion. php benchmark code: <?php /** * benchmark function for scoring function perfomance by cycling it given times * bmark(int $rounds, string $function, mixed $arg [, mixed $... ] ): ?float */ function bmark(): ?float { $args = func_get_args(); $len = count($args); if ($len < 3) { trigger_error("At least 3 args expected. Only $len given.", 256); return null; } $cnt = array_shift($args); $fun = array_shift($args); $start = microtime(true); $i = 0; while ($i < $cnt) { ++$i; $res = call_user_func_array($fun, $args); } $end = microtime(true) - $start; return $end; } /* this function to convert data size value in bytes to the best unit of measurement */ function convert($size){ if ($size == 0) { return 0; } $unit = array('b', 'kb', 'mb', 'gb', 'tb', 'pb'); $i = (int)floor(log($size, 1024)); return round($size / pow(1024, $i), 1) . $unit[$i]; } $string = "foobar"; $utf16 = mb_convert_encoding($string,"UTF-16"); $k = 1e6; $long = str_repeat($utf16, $k); $size = convert(strlen($long)); $rounds = 1; $split_length = 50; echo "String $string was repeated $k times. Result string size is $size\n"; printf("mb_str_split(): string was splitted by %d into %d chunks %d in %f s\n" , $split_length , count(mb_str_split($long, $split_length, "UTF-16")) , $rounds , bmark($rounds, "mb_str_split", $long, $split_length, "UTF-16") ); printf("mb_str_split_utf16(): string was splitted by %d into %d chunks %d in %f s\n" , $split_length , count(mb_str_split_utf16($long, $split_length, "UTF-16")) , $rounds , bmark($rounds, "mb_str_split_utf16", $long, $split_length, "UTF-16") ); On Mon, 11 Feb 2019 at 18:00, Dan Ackroyd <dan...@basereality.com> wrote: > On Sun, 10 Feb 2019 at 12:29, Legale Legage <legale.leg...@gmail.com> > wrote: > > > > > > > https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38 > > > > To do, or not to do: that is the question. > > What do you think? > > Opening separate pull requests for separate changes is good as it > allows them to be discussed separately. That change is bundled with > the mb_str_split() changes, so it's quite hard to see what is > optimisation and what is part of the approved RFC. > > Although memory is cheap, the change appears to increase the static > allocation of memory by 128KB for something that >95% of PHP > programmers will never use, which is not a good idea. > > > show a more than 2 times speed increase. > > Lies, damn lies and statistics. > > If it takes the time to parse a megabyte string from 0.000002 to > 0.000001, no one cares. > If it takes the time to parse a megabyte string from 2 seconds to 1 > second, wow that's great! > > i.e. Saying a two times speed increase without context doesn't give > people enough information to evaluate it. > > But this would be easier to discuss as a separate PR. > > cheers > Dan >