Hi all, Recently I've been doing a lot of data processing. Usually this involves first transforming the source material into some semi-structured format like CSV or TSV, after which I can start gathering information about the data. Both of these steps can be achieved quite effectively using the programs packaged in coreutils. For the times when more complex transformations are required I usually end up using python and the plethora of third party libraries available. Lastly I might plot the information into charts using gnuplot or matplotlib.
Working with delimiter separated records is easy in most cases, as most programs in coreutils allow the user to define the delimiter character. One exception to this rule is the program `uniq' and its -c option, which prefixes lines by the number of occurrences, as it doesn't allow for defining the output delimiter character used between the count of occurrences and the line itself, using space instead. Take this excerpt from a file as an example, each record contains a field for an address, country, and continent: [...] 112.85.42.89,China,Asia 111.229.139.95,China,Asia 111.202.211.10,China,Asia 143.137.9.165,Brazil,South America 110.43.50.229,China,Asia 13.67.33.9,Singapore,Asia 79.8.196.108,Italy,Europe 104.248.244.119,Germany,Europe 106.12.31.186,United States,North America 156.67.217.63,Singapore,Asia [...] We might want to get the count of occurrences of each country,continent pair. To achieve this we filter out the addresses with cut, sort the lines and pass the result to uniq, which then counts the occurrences: < data.csv cut -d, -f2,3 | sort | uniq -c Resulting in the following output: 1 Brazil,South America 4 China,Asia 1 Germany,Europe 1 Italy,Europe 2 Singapore,Asia 1 United States,North America Which of course we need to post-process, since now the occurrence and the country are effectively part of the same field, so we use the sed command: sed 's/\( \{0,6\}[[:digit:]]\+\) /\1,/' ** to get them into separate fields: 1,Brazil,South America 4,China,Asia 1,Germany,Europe 1,Italy,Europe 2,Singapore,Asia 1,United States,North America Resulting in valid comma separated records, which we can then run additional transformations on. Counting occurrences is quite common, at least for me, so this kind of post-processing has to be done quite often as well. I propose the addition of an option named `--count-separator' or similar to the uniq command, to allow setting a user-defined separator character. This separator character would be inserted between the occurrence count and the line, and it would require the `-c' option to also be used. Here is an example of its usage: < data.csv cut -d, -f2,3 | sort | uniq -c --count-separator=, The output would be equal to the output of the following program, which I demonstrated earlier: < data.csv cut -d, -f2,3 | sort | uniq -c | \ sed 's/\( \{0,6\}[[:digit:]]\+\) /\1,/' This would remove the need for an additional post-processing step entirely, and bring the uniq program more in line with other text-processing utilities in coreutils that already allow setting custom field separators. I have hacked on the implementation already, and attached a patch for you to experiment and give feedback on. -- Roni Kallio ** The sed command I use also removes leading spaces, but that isn't a necessary step so I left it out.
>From 6b43c1d0ea6faa460f77555eeeb74e0523e2f3a0 Mon Sep 17 00:00:00 2001 From: Roni Kallio <r...@kallio.app> Date: Thu, 8 Oct 2020 01:33:38 +0300 Subject: [PATCH] uniq: --count-separator option Add a count-separator option, that allows us to set the separator used between occurrences and line content, when the -c option is enabled. --- src/uniq.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/src/uniq.c b/src/uniq.c index e0247579b..2e9d83e9d 100644 --- a/src/uniq.c +++ b/src/uniq.c @@ -79,6 +79,10 @@ static bool output_later_repeated; /* If true, ignore case when comparing. */ static bool ignore_case; +/* Separator character used between output lines and the number of times they + occur in the input. */ +static char count_separator = ' '; + enum delimit_method { /* No delimiters output. --all-repeated[=none] */ @@ -136,12 +140,14 @@ static enum grouping_method grouping = GM_NONE; enum { - GROUP_OPTION = CHAR_MAX + 1 + GROUP_OPTION = CHAR_MAX + 1, + COUNT_SEPARATOR_OPTION }; static struct option const longopts[] = { {"count", no_argument, NULL, 'c'}, + {"count-separator", required_argument, NULL, COUNT_SEPARATOR_OPTION}, {"repeated", no_argument, NULL, 'd'}, {"all-repeated", optional_argument, NULL, 'D'}, {"group", optional_argument, NULL, GROUP_OPTION}, @@ -178,6 +184,9 @@ With no options, matching lines are merged to the first occurrence.\n\ fputs (_("\ -c, --count prefix lines by the number of occurrences\n\ + --count-separator=CHAR separator character used between output lines\n\ + and the number of times they occur in the\n\ + input \n\ -d, --repeated only print duplicate lines, one for each group\n\ "), stdout); fputs (_("\ @@ -308,7 +317,7 @@ writeline (struct linebuffer const *line, return; if (countmode == count_occurrences) - printf ("%7" PRIuMAX " ", linecount + 1); + printf ("%7" PRIuMAX "%c", linecount + 1, count_separator); fwrite (line->buffer, sizeof (char), line->length, stdout); } @@ -483,6 +492,7 @@ main (int argc, char **argv) char const *file[2]; char delimiter = '\n'; /* change with --zero-terminated, -z */ bool output_option_used = false; /* if true, one of -u/-d/-D/-c was used */ + bool count_separator_used = false; /* true if --count-separator was used */ file[0] = file[1] = "-"; initialize_main (&argc, &argv); @@ -594,6 +604,18 @@ main (int argc, char **argv) grouping_method_map); break; + case COUNT_SEPARATOR_OPTION: + /* Separator for count and line. Interpret --count-separator='' to + mean 'use NUL byte as * delimiter.' */ + if (optarg[0] != '\0' && optarg[1] != '\0') + { + error (0, 0, _("The separator must be a single character")); + usage (EXIT_FAILURE); + } + count_separator = optarg[0]; + count_separator_used = true; + break; + case 'f': skip_field_option_type = SFO_NEW; skip_fields = size_opt (optarg, @@ -656,6 +678,12 @@ main (int argc, char **argv) usage (EXIT_FAILURE); } + if (count_separator_used && countmode == count_none) + { + error (0, 0, _("--count-separator requires -c")); + usage (EXIT_FAILURE); + } + check_file (file[0], file[1], delimiter); return EXIT_SUCCESS; -- 2.26.2