On Mon, 10 Jun 2024 23:10:05 GMT, Shaojin Wen <d...@openjdk.org> wrote:
>> @wenshao >>> I think the performance of the Unsafe branch may be the best data for the >>> C2 optimizer. @eme64 can help me see if C2 can do it? >> >> Have you tried to see if the optimization actually was done/taken? You can >> use the `TraceMergeStores,` flag. Can you present the generated assembly >> code of the benchmarks, and explain the difference based on the generated >> assembly code? You can run JMH penchmarks with `perf`. These two blogs may >> help you: >> >> http://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html >> https://shipilev.net/blog/2016/arrays-wisdom-ancients/#_meet_jmh_prof_perfasm >> >> @liach I don't think it makes a difference if it is `int` or `byte` >> constants. Or what exactly is the code change you are proposing? > > @eme64 It seems that when the following code uses StringUTF16.putChar, C2's > optimization is not as good as the manual merging and storage effect. > > class AbstractStringBuilder { > private AbstractStringBuilder appendNull() { > // ... > StringUTF16.putCharsAt(val, count, 'n', 'u', 'l', 'l'); > // ... > } > > public AbstractStringBuilder append(boolean b) { > // ... > StringUTF16.putCharsAt(val, count, 't', 'r', 'u', 'e'); > // ... > StringUTF16.putCharsAt(val, count, 'f', 'a', 'l', 's', 'e'); > // ... > } > } > > class StringUTF16 { > public static void putCharsAt(byte[] value, int i, char c1, char c2, char > c3, char c4) { > putChar(value, i , c1); > putChar(value, i + 1, c2); > putChar(value, i + 2, c3); > putChar(value, i + 3, c4); > } > > @IntrinsicCandidate > // intrinsic performs no bounds checks > static void putChar(byte[] val, int index, int c) { > assert index >= 0 && index < length(val) : "Trusted caller missed > bounds check"; > index <<= 1; > val[index++] = (byte)(c >> HI_BYTE_SHIFT); > val[index] = (byte)(c >> LO_BYTE_SHIFT); > } > } > > > The code for manually merging storage is as follows, without using Unsafe: > > class AbstractStringBuilder { > static final long NULL_UTF16; > static final long TRUE_UTF16; > static final long FALS_UTF16; > > static { > byte[] bytes = new byte[8]; > > StringUTF16.putCharsAt(bytes, 0, 'n', 'u', 'l', 'l'); > NULL_UTF16 = getLong(bytes, 0); > > StringUTF16.putCharsAt(bytes, 0, 't', 'r', 'u', 'e'); > TRUE_UTF16 = getLong(bytes, 0); > > StringUTF16.putCharsAt(bytes, 0, 'f', 'a', 'l', 's'); > FALS_UTF16 = getLong(bytes, 0); > } > > private static long getLong(byte[] bytes, int offset) { > return (((long)bytes[offset ] & 0xff) ) | > (((long)bytes[offset + 1] & 0xff) << 8) | > (((long)bytes[offset + 2] & 0xff) << 16) | > (((long)bytes[offset + 3] & 0xff) << 24) | > (((long)bytes[offset + 4] & 0xff) << 32) | > (((long)bytes[offset + 5] & 0xff) << 40) | > (((long)bytes[offset + 6] & 0xff) << 48) | > (((long)bytes[offset + 7] & 0xff) << 56); > } > > private static void setLong(byte[] array, int offset, long value) { > array[offset] = (byte) value; > array[offset + 1] = (byte) (value >> 8); > array[offset + 2] = (byte) (value >> 16); > array[offset + 3] = (byte) (value >> 24); > arra... @wenshao > @eme64 It seems that when the following code uses StringUTF16.putChar, C2's > optimization is not as good as the manual merging and storage effect. As I asked above, you will need to provide some evidence / generated assembly / perf data, and logs from `TraceMergeStores`. I currently do not have time to produce these myself, and I think they would be crucial to determine where the missing performance has gone. See my earlier comment: https://github.com/openjdk/jdk/pull/19626#issuecomment-2158533469 And please also try @cl4es advide here: https://github.com/openjdk/jdk/pull/19626#issuecomment-2159509806 And sure, maybe you need some public API for setting multiple bytes at once, which the `MergeStores` optimization can optimize. I'm a C2 engineer, so I leave that up to the library folks ;) ------------- PR Comment: https://git.openjdk.org/jdk/pull/19626#issuecomment-2159895757