On Fri, 1 Sep 2023 10:57:57 GMT, iaroslavski <d...@openjdk.org> wrote:
>>> @AlanBateman If it helps, the changes made by @vamsi-parasa to >>> DualPivotQuickSort.java don't change the logic as written in Java. They >>> only carve out the functionality into separate Java methods retaining the >>> meaning exactly as before. These Java methods are then optimized through a >>> stub. >> >> Hi Alan, >> >> As Sandhya (@sviswa7) mentioned, this PR does not make any logic changes to >> the DualPivotQuickSort.java. >> >> The code was refactored to separate out the partitioning logic (dual pivot >> and single pivot) into separate methods. These methods are being >> intrinsified using AVX512 to do vectorized partitioning preserving the logic >> of Java side. Similarly, the methods to sort small arrays >> (insertionSort/mixedInsertionSort) are being intrinsified as well to use >> AVX512 sort while preserving the logic on Java side. >> >> Could you please go through the changes and provide your comments and >> suggestions? >> >> Thanks, >> Vamsi > > Hi Vamsi (@vamsi-parasa)! > > Please see my few questions/suggestions related to method arraySort() and > other code. > > **1.** If size < MIN_FAST_SMALL_ARRAY_SORT_SIZE (=16), you don't go into > intrinsic arraySort(), method > mixedInsertionSort is invoked instead. > > `if (size < MAX_MIXED_INSERTION_SORT_SIZE + bits && (bits & 1) > 0) {` > ` int last = high - 3 * ((size >> 5) << 3);` > ` if (size < MIN_FAST_SMALL_ARRAY_SORT_SIZE) mixedInsertionSort(a, low, > last , high);` > ` else arraySort(int.class, a, baseOffset, low, high, last);` > ` return;` > `}` > > Is Java mixedInsertionSort actually faster than intrinsic arraySort() on > small (< 16) arrays? > What if you try to invoke arraySort() on all small arrays and don't use > constant MIN_FAST_SMALL_ARRAY_SORT_SIZE at all? > Like this: > > `if (size < MAX_MIXED_INSERTION_SORT_SIZE + bits && (bits & 1) > 0) {` > ` int last = high - 3 * ((size >> 5) << 3);` > ` arraySort(int.class, a, baseOffset, low, high, last);` > ` return;` > `}` > > Could you please check and benchmark it? > > **2.** On the next step you apply the same approach when you invoke > insertionSort() on the small leftmost part. > > `if (size < MAX_INSERTION_SORT_SIZE) {` > ` if (size < MIN_FAST_SMALL_ARRAY_SORT_SIZE) insertionSort(a, low, high);` > ` else arraySort(int.class, a, baseOffset, low, high, -1);` > ` return;` > `}` > > But this invocation happens only once, on the leftmost part only. I think we > can invoke Java code and > don't go to intrinsic arraySort(). I guess we will not have profit with > intrinsic method here. See: > > `if (size < MAX_INSERTION_SORT_SIZE) {` > ` insertionSort(a, low, high);` > ` return;` > `}` > > Could you please try this case without intrinsic and benchmark it? > > **3.** You introduce constant int baseOffset = Unsafe.ARRAY_INT_BASE_OFFSET > inside the loop. > What if we pass the constant directly? For example: > > `arraySort(int.class, a, Unsafe.ARRAY_INT_BASE_OFFSET, low, high, last);` > > **3.A** The first parameter of arraySort() is elemType (int.class, long,class > etc.) > The second parameter base offset depends on this elem type. For example, > if elemType is int.class, base offset will be Unsafe.ARRAY_INT_BASE_OFFSET. > Can we eliminate base offset here and set up value inside intrinsic? > > **4.** You define 'int[] pivotIndices;' at the beginning of the loop, but the > indices will be used > later (or not used at all). My proposal to declare pivotIndices in > if-then-else, see: > > `int[] pivotIndices = new int[] {e1, e5};` > > **5.** Method arrayPartition has argument isDualPi... Hello Vladimir (@iaroslavski) Please have a look at the new commit that was recently pushed. It incorporates the various suggestions you've given. For small arrays (<16 for 32bit types and <20 for 64 bit types), it was observed that insertionSort is faster than the AVX512 bitonic sort. In order to address this, the insertionSort has been moved into the native instrinsic and is now actually slightly faster than the Java version as seen int the table below. On the Java side, it looks cleaner as there are no threshold checks to switch between the intrinsic and Java insertion sort. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/sparasa/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#0563C1" vlink="#954F72"> Benchmark | (builder) | Size | Insertion Sort (Java) | AVX512 bitonic sort | Speedup (before) | AVX512 bitonic + Insertion Sort (C++) | Speedup (now) -- | -- | -- | -- | -- | -- | -- | -- ArraysSort.Long.testSort | RANDOM | 5 | 0.021 | 0.042 | 0.50 | 0.019 | 1.11 ArraysSort.Long.testSort | RANDOM | 10 | 0.037 | 0.061 | 0.61 | 0.031 | 1.19 ArraysSort.Long.testSort | RANDOM | 15 | 0.055 | 0.055 | 1.00 | 0.054 | 1.02 ArraysSort.Long.testSort | RANDOM | 20 | 0.077 | 0.08 | 0.96 | 0.069 | 1.12 ArraysSort.Long.testSort | RANDOM | 25 | 0.102 | 0.079 | 1.29 | 0.079 | 1.29 ArraysSort.Long.testSort | RANDOM | 50 | 0.253 | 0.233 | 1.09 | 0.238 | 1.06 ArraysSort.Long.testSort | RANDOM | 75 | 0.381 | 0.291 | 1.31 | 0.282 | 1.35 </body> </html> Thanks, Vamsi ------------- PR Comment: https://git.openjdk.org/jdk/pull/14227#issuecomment-1712116207