I noticed a speed improvement by changing LC_ALL from “C.UTF-8” to “C” from 22s to 1.3s. This is huge, and should not be ignored.
My setting: I call sort in a Dockerfile, derived from the official “python” base image. It sets LANG to “C.UTF-8”. Thus, my setting is not exotic. While I don’t really understand why UTF-8 encoding has that much impact on sorting performance, it may well be. However, this should be mentioned in the documentation in my option. Something like “Note that anything but plain C local may have significant impact on sorting performance” should occur somewhere in the man and info pages. -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to coreutils in Ubuntu. https://bugs.launchpad.net/bugs/846628 Title: gnu sort extremely slow in non C locale Status in coreutils package in Ubuntu: Confirmed Bug description: I tried sorting an ascii file of about 300 Megs and 8 million lines with gnu sort and it was taking forever. After 10 minutes I stopped it. I tried another sort program and it finished in about 40 seconds. I then took the output of that second sort and I checked it in gnu sort, which reported that some lines were out of order. The following lines: ....bbbbbbbbbwbbwwbwwwwwww.ww...1 ....bbbbbbbbbwbbwwbwwwwwwwww....0 ....bbbbbbbbbwbwwbwbwwwww.ww..w.1 But they are not as far as I can tell. Then I thought the problem was the locale. Indeed my locale was set to: LANG=en_CA.UTF-8 setting it to: LANG=C both made gnu sort finish the sort in 40 seconds, and confirm the proper order. Since the file is %100 ASCII (it only has the 6 characters ".01bw\n" I think this is a bug, that the locale should make any difference. Best regards, Bijan ProblemType: Bug DistroRelease: Ubuntu 11.04 Package: coreutils 8.5-1ubuntu6 ProcVersionSignature: Ubuntu 2.6.38-11.48-generic 2.6.38.8 Uname: Linux 2.6.38-11-generic i686 Architecture: i386 Date: Sat Sep 10 15:59:07 2011 InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Release i386 (20110427.1) ProcEnviron: LANGUAGE=en_CA:en LANG=en_CA.UTF-8 SHELL=/bin/bash SourcePackage: coreutils UpgradeStatus: No upgrade log present (probably fresh install) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/846628/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp