I noticed a speed improvement by changing LC_ALL from “C.UTF-8” to “C”
from 22s to 1.3s.  This is huge, and should not be ignored.

My setting: I call sort in a Dockerfile, derived from the official
“python” base image.  It sets LANG to “C.UTF-8”.  Thus, my setting is
not exotic.

While I don’t really understand why UTF-8 encoding has that much impact
on sorting performance, it may well be.  However, this should be
mentioned in the documentation in my option.  Something like “Note that
anything but plain C local may have significant impact on sorting
performance” should occur somewhere in the man and info pages.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to coreutils in Ubuntu.
https://bugs.launchpad.net/bugs/846628

Title:
  gnu sort extremely slow in non C locale

Status in coreutils package in Ubuntu:
  Confirmed

Bug description:
  I tried sorting an ascii file of about 300 Megs and 8 million lines
  with gnu sort and it was taking forever.

  After 10 minutes I stopped it. I tried another sort program and it
  finished in about 40 seconds.

  I then took the output of that second sort and I checked it in gnu
  sort, which reported that some lines were out of order.

  The following lines:
  ....bbbbbbbbbwbbwwbwwwwwww.ww...1
  ....bbbbbbbbbwbbwwbwwwwwwwww....0
  ....bbbbbbbbbwbwwbwbwwwww.ww..w.1

  But they are not as far as I can tell. Then I thought the problem was the 
locale. Indeed my locale was set to:
  LANG=en_CA.UTF-8

  setting it to:
  LANG=C

  both made gnu sort finish the sort in 40 seconds, and confirm the
  proper order.

  Since the file is %100 ASCII (it only has the 6 characters ".01bw\n" I
  think this is a bug, that the locale should make any difference.

  Best regards,
  Bijan

  ProblemType: Bug
  DistroRelease: Ubuntu 11.04
  Package: coreutils 8.5-1ubuntu6
  ProcVersionSignature: Ubuntu 2.6.38-11.48-generic 2.6.38.8
  Uname: Linux 2.6.38-11-generic i686
  Architecture: i386
  Date: Sat Sep 10 15:59:07 2011
  InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Release i386 (20110427.1)
  ProcEnviron:
   LANGUAGE=en_CA:en
   LANG=en_CA.UTF-8
   SHELL=/bin/bash
  SourcePackage: coreutils
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/coreutils/+bug/846628/+subscriptions

-- 
Mailing list: https://launchpad.net/~touch-packages
Post to     : touch-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~touch-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to