Hello, I'd like to suggest a new feature to sort: the ability to set the buffer size (-S/--buffer-size X) using an environment variable.
In summary: $ export SORT_BUFFER_SIZE=20G $ someprogram | sort -k1,1 > output.txt # sort will use 20G of RAM, as if "--buffer-size 20G" was specified. The rational: recent commits improved the guessed buffer size when sort is given an input file, but these don't apply if sort is used as part of a pipe line, with a pipe as input, e.g. some | program | sort | other | programs > file (Tested with v8.19 on linux 2.6.32, sort consumes few MBs of RAM, even though many GBs are available). This results in many small temporary files being created. The script (which uses sort) is not under my direct control, but even if it was, I don't want to hard-code the amount of memory used, to keep it portable to different servers. AFAIK, there are four aspects of sort the affect performance: 1. number of threads: changeable with "--parallel=X" and with environment variable OMP_NUM_THREADS. 2. temporary files location: changeable with "--temporary-directory=DIR" and with environment variable TMPDIR. 3. memory usage: changeable with "--buffer-size=SIZE" but not with environment variable. 4. compression program: changeable with "--compression-program=PROG" but not with environment variable. (but at the moment, I do not address this aspect). With the attached patch, sort will read an environment variable named "SORT_BUFFER_SIZE", and will treat it as if "--buffer-size" was specified (but only if "--buffer-size" wasn't used on the command line). If this is conceptually acceptable, I'll prepare a proper patch (with NEWS, help, docs, etc.). Regards, -gordon
>From db8f1c319d772c5b13df51894f279c3a7276416e Mon Sep 17 00:00:00 2001 From: "A. Gordon" <[email protected]> Date: Wed, 29 Aug 2012 16:42:31 -0400 Subject: [PATCH] sort: accept buffer size from environment variable. --- src/sort.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/src/sort.c b/src/sort.c index 9dbfee1..1505a6d 100644 --- a/src/sort.c +++ b/src/sort.c @@ -4648,6 +4648,13 @@ main (int argc, char **argv) files = − } + if (sort_size == 0) + { + char const *buffer_size = getenv ("SORT_BUFFER_SIZE"); + if (buffer_size) + specify_sort_size(-1,'S',buffer_size); + } + /* Need to re-check that we meet the minimum requirement for memory usage with the final value for NMERGE. */ if (0 < sort_size) -- 1.7.9.1
