Re: RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test [v2]

Eirik Bjorsnos Wed, 15 Mar 2023 07:32:16 -0700

On Wed, 15 Mar 2023 13:50:44 GMT, Francesco Nigro <d...@openjdk.org> wrote:


>> I created a randomized version of `Characters.isDigit` which tests with code 
>> points picked at random such that any category (Latin1, negative, different 
>> planes, unassiged) are equally probable.
>> 
>> Baseline:
>> 
>> 
>> Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
>> Characters.isDigitRandom         1632  avgt   15  5.503 ± 0.371  ns/op
>> 
>> 
>> Current PR:
>> 
>> 
>> Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
>> Characters.isDigitRandom         1632  avgt   15  5.393 ± 0.336  ns/op
>> 
>> 
>> Using StringLatin1.canEncode:
>> 
>> 
>> Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
>> Characters.isDigitRandom         1632  avgt   15  5.377 ± 0.322  ns/op
>> 
>> 
>> Seems the PR still has a small improvement for this scenario. The 
>> StringLatin1.canEncode regression disappears.
>> 
>> In the real world ASCII/Latin1 seems to dominate most data, so this scenario 
>> is perhaps not very realistic.
>> 
>> I'm running this on a Mac, so cannot try `-prof perfnorm`.
>
> Many thanks to have tried, yep, I was curious indeed re the 
> "StringLatin1.canEncode regression" case.
> I would still modify the benchmark to use inputs (I know that will make it 
> memory bound sadly, due to reading inputs - but the size of such inputs can 
> be a benchmark parameter, together with the bias eg "latin","mix", 
> "non-latin") "semi-randomly" generated based on the mentioned 
> strategies/biases. 
> It will benefit future tests on this, although could be provided as a 
> separate PR.

> The StringLatin1.canEncode regression disappears.

I mixed things up so StringLatin1.canEncode was benchmarked without the updated 
code.

Here are updated benchmark results:


Baseline:


Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
Characters.isDigitRandom         1632  avgt   15  5.437 ± 0.235  ns/op


PR:


Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
Characters.isDigitRandom         1632  avgt   15  5.319 ± 0.341  ns/op


StringLatin1.canEncode:


Benchmark                 (codePoint)  Mode  Cnt  Score   Error  Units
Characters.isDigitRandom         1632  avgt   15  5.447 ± 0.304  ns/op
``` 

So it seems using StringLatin1.canEncode still might have a regression also in 
the randomized input case.

For this PR, I suggest we update StringLatin1.canEncode to be in sync with 
CharacterData.of, without one calling the other. If anyone wants to investigate 
the regression further, than can be done outside this PR.

I have independently verified that StringLatin1.canEncode sees performance 
improvements using the StringIndexOf benchmark.

-------------

PR: https://git.openjdk.org/jdk/pull/13040

Re: RFR: 8304245: Speed up CharacterData.of by avoiding bit shifting in the latin1 fast-path test [v2]

Reply via email to