During the past 6 months I have been using GNU sort on a lot of files in the GByte range on my 64 core machine. I have discovered, that GNU sort is bad at using all the cores.
I have built parsort (now distributed as part of GNU parallel) as a small wrapper around GNU sort and it performs 2-3x better than GNU sort for files of this size on this machine. The first step works great in GNU sort: Here you can really use all your cores. But after the first step, it seems GNU sort does a merge sort of all the results. And this pegs a single CPU at 100% while the rest of the cores are pretty lazy. It seems the merging is not parallelized. parsort does that a little better than GNU sort. Instead of doing a k-way merge at the final step, it does a 2-way merge on each core. It still has to a 2-way merge of all data on a single core in the end, so it is not even close to optimal. Wikipedia describes a way of doing parallel merge: https://en.wikipedia.org/wiki/Merge_sort#Parallel_multiway_merge_sort I am no C-coder, and I respect that we want coreutils to be written in C. Unfortunately that means I cannot really help doing the actual coding (at least not in a way that would not introduce a plethora of bugs). But looking at the pseudocode https://en.wikipedia.org/wiki/Merge_sort#Pseudocode it does not look that hard to implement. This method requires all data to fit in memory, but there are even more advanced methods (such as https://www.sciencedirect.com/science/article/pii/S0304397500002875) that would not require this. Could you consider implementing a parallel merge, so I can retire parsort? (Also: Thanks for maintaining a vital piece of software). /Ole