Hello, internals.

As Rowan Collins suggested i've replaced lookup table with simple macros:
#define UTF16_LE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0xFC00 == 0xD800)
#define UTF16_BE_CODE_UNIT_IS_HIGH_SURROGATE (code_unit & 0x00FC == 0x00D8)

I repeated the benchmarks again. Here is the results:

String foobar was repeated 1000000 times. Result string size is 11.4mb
mb_str_split():           string was splitted by 50 into 120000 chunks 1 in
0.400670 s
mb_str_split_utf16(): string was splitted by 50 into 120000 chunks 1 in
0.038947 s


I satisfied my research interest. The question is there practical value?
Interested in your opinion.


php benchmark code:

<?php
/**
 * benchmark function for scoring function perfomance by cycling it given
times
 * bmark(int $rounds, string $function, mixed $arg [, mixed $... ] ): ?float
 */
function bmark(): ?float
{
    $args = func_get_args();
    $len = count($args);

    if ($len < 3) {
        trigger_error("At least 3 args expected. Only $len given.", 256);
        return null;
    }

    $cnt = array_shift($args);
    $fun = array_shift($args);

    $start = microtime(true);
    $i = 0;
    while ($i < $cnt) {
        ++$i;
        $res = call_user_func_array($fun, $args);
    }
    $end = microtime(true) - $start;
    return $end;
}
/* this function to convert data size value in bytes to the best unit of
measurement */
function convert($size){
    if ($size == 0) {
        return 0;
    }
    $unit = array('b', 'kb', 'mb', 'gb', 'tb', 'pb');
    $i = (int)floor(log($size, 1024));
    return round($size / pow(1024, $i), 1) . $unit[$i];
}

$string = "foobar";
$utf16 =  mb_convert_encoding($string,"UTF-16");
$k = 1e6;
$long = str_repeat($utf16, $k);
$size = convert(strlen($long));
$rounds = 1;
$split_length = 50;

echo "String $string was repeated $k times. Result string size is $size\n";
printf("mb_str_split():       string was splitted by %d into %d chunks %d
in %f s\n"
  , $split_length
  , count(mb_str_split($long, $split_length, "UTF-16"))
  , $rounds
  , bmark($rounds, "mb_str_split", $long, $split_length, "UTF-16")
);

printf("mb_str_split_utf16(): string was splitted by %d into %d chunks %d
in %f s\n"
  , $split_length
  , count(mb_str_split_utf16($long, $split_length, "UTF-16"))
  , $rounds
  , bmark($rounds, "mb_str_split_utf16", $long, $split_length, "UTF-16")
);




On Mon, 11 Feb 2019 at 18:00, Dan Ackroyd <dan...@basereality.com> wrote:

> On Sun, 10 Feb 2019 at 12:29, Legale Legage <legale.leg...@gmail.com>
> wrote:
> >
> >
> >
> https://github.com/php/php-src/pull/3715/commits/d868059626290b7ba773b957045e08c3efb1d603#diff-22d593ced03b2cb94450d9f9990865c8R38
> >
> > To do, or not to do: that is the question.
> > What do you think?
>
> Opening separate pull requests for separate changes is good as it
> allows them to be discussed separately. That change is bundled with
> the mb_str_split() changes, so it's quite hard to see what is
> optimisation and what is part of the approved RFC.
>
> Although memory is cheap, the change appears to increase the static
> allocation of memory by 128KB for something that >95% of PHP
> programmers will never use, which is not a good idea.
>
> > show a more than 2 times speed increase.
>
> Lies, damn lies and statistics.
>
> If it takes the time to parse a megabyte string from 0.000002 to
> 0.000001, no one cares.
> If it takes the time to parse a megabyte string from 2 seconds to 1
> second, wow that's great!
>
> i.e. Saying a two times speed increase without context doesn't give
> people enough information to evaluate it.
>
> But this would be easier to discuss as a separate PR.
>
> cheers
> Dan
>

Reply via email to