On Wed, 15 Mar 2023 13:50:44 GMT, Francesco Nigro <d...@openjdk.org> wrote:
>> I created a randomized version of `Characters.isDigit` which tests with code >> points picked at random such that any category (Latin1, negative, different >> planes, unassiged) are equally probable. >> >> Baseline: >> >> >> Benchmark (codePoint) Mode Cnt Score Error Units >> Characters.isDigitRandom 1632 avgt 15 5.503 ± 0.371 ns/op >> >> >> Current PR: >> >> >> Benchmark (codePoint) Mode Cnt Score Error Units >> Characters.isDigitRandom 1632 avgt 15 5.393 ± 0.336 ns/op >> >> >> Using StringLatin1.canEncode: >> >> >> Benchmark (codePoint) Mode Cnt Score Error Units >> Characters.isDigitRandom 1632 avgt 15 5.377 ± 0.322 ns/op >> >> >> Seems the PR still has a small improvement for this scenario. The >> StringLatin1.canEncode regression disappears. >> >> In the real world ASCII/Latin1 seems to dominate most data, so this scenario >> is perhaps not very realistic. >> >> I'm running this on a Mac, so cannot try `-prof perfnorm`. > > Many thanks to have tried, yep, I was curious indeed re the > "StringLatin1.canEncode regression" case. > I would still modify the benchmark to use inputs (I know that will make it > memory bound sadly, due to reading inputs - but the size of such inputs can > be a benchmark parameter, together with the bias eg "latin","mix", > "non-latin") "semi-randomly" generated based on the mentioned > strategies/biases. > It will benefit future tests on this, although could be provided as a > separate PR. > The StringLatin1.canEncode regression disappears. I mixed things up so StringLatin1.canEncode was benchmarked without the updated code. Here are updated benchmark results: Baseline: Benchmark (codePoint) Mode Cnt Score Error Units Characters.isDigitRandom 1632 avgt 15 5.437 ± 0.235 ns/op PR: Benchmark (codePoint) Mode Cnt Score Error Units Characters.isDigitRandom 1632 avgt 15 5.319 ± 0.341 ns/op StringLatin1.canEncode: Benchmark (codePoint) Mode Cnt Score Error Units Characters.isDigitRandom 1632 avgt 15 5.447 ± 0.304 ns/op ``` So it seems using StringLatin1.canEncode still might have a regression also in the randomized input case. For this PR, I suggest we update StringLatin1.canEncode to be in sync with CharacterData.of, without one calling the other. If anyone wants to investigate the regression further, than can be done outside this PR. I have independently verified that StringLatin1.canEncode sees performance improvements using the StringIndexOf benchmark. ------------- PR: https://git.openjdk.org/jdk/pull/13040