uniq: field separator for -c option

Roni Kallio Wed, 07 Oct 2020 17:14:24 -0700

Hi all,

Recently I've been doing a lot of data processing.  Usually this
involves first transforming the source material into some
semi-structured format like CSV or TSV, after which I can start
gathering information about the data.  Both of these steps can be
achieved quite effectively using the programs packaged in coreutils.
For the times when more complex transformations are required I usually
end up using python and the plethora of third party libraries available.
Lastly I might plot the information into charts using gnuplot or
matplotlib.


Working with delimiter separated records is easy in most cases, as most
programs in coreutils allow the user to define the delimiter character.
One exception to this rule is the program `uniq' and its -c option,
which prefixes lines by the number of occurrences, as it doesn't allow
for defining the output delimiter character used between the count of
occurrences and the line itself, using space instead.

Take this excerpt from a file as an example, each record contains a
field for an address, country, and continent:

[...]
112.85.42.89,China,Asia
111.229.139.95,China,Asia
111.202.211.10,China,Asia
143.137.9.165,Brazil,South America
110.43.50.229,China,Asia
13.67.33.9,Singapore,Asia
79.8.196.108,Italy,Europe
104.248.244.119,Germany,Europe
106.12.31.186,United States,North America
156.67.217.63,Singapore,Asia
[...]

We might want to get the count of occurrences of each country,continent
pair.  To achieve this we filter out the addresses with cut, sort the
lines and pass the result to uniq, which then counts the occurrences:

< data.csv cut -d, -f2,3 | sort | uniq -c

Resulting in the following output:

      1 Brazil,South America
      4 China,Asia
      1 Germany,Europe
      1 Italy,Europe
      2 Singapore,Asia
      1 United States,North America

Which of course we need to post-process, since now the occurrence and
the country are effectively part of the same field, so we use the sed
command:

sed 's/\( \{0,6\}[[:digit:]]\+\) /\1,/' **

to get them into separate fields:

      1,Brazil,South America
      4,China,Asia
      1,Germany,Europe
      1,Italy,Europe
      2,Singapore,Asia
      1,United States,North America

Resulting in valid comma separated records, which we can then run
additional transformations on.  Counting occurrences is quite common, at
least for me, so this kind of post-processing has to be done quite often
as well.

I propose the addition of an option named `--count-separator' or similar
to the uniq command, to allow setting a user-defined separator
character. This separator character would be inserted between the
occurrence count and the line, and it would require the `-c' option to
also be used. Here is an example of its usage:

< data.csv cut -d, -f2,3 | sort | uniq -c --count-separator=,

The output would be equal to the output of the following program, which
I demonstrated earlier:

< data.csv cut -d, -f2,3 | sort | uniq -c | \
        sed 's/\( \{0,6\}[[:digit:]]\+\) /\1,/'

This would remove the need for an additional post-processing step
entirely, and bring the uniq program more in line with other
text-processing utilities in coreutils that already allow setting custom
field separators.

I have hacked on the implementation already, and attached a patch for
you to experiment and give feedback on.

--
Roni Kallio

** The sed command I use also removes leading spaces, but that isn't a
   necessary step so I left it out.

>From 6b43c1d0ea6faa460f77555eeeb74e0523e2f3a0 Mon Sep 17 00:00:00 2001
From: Roni Kallio <r...@kallio.app>
Date: Thu, 8 Oct 2020 01:33:38 +0300
Subject: [PATCH] uniq: --count-separator option

Add a count-separator option, that allows us to set the separator used
between occurrences and line content, when the -c option is enabled.
---
 src/uniq.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/src/uniq.c b/src/uniq.c
index e0247579b..2e9d83e9d 100644
--- a/src/uniq.c
+++ b/src/uniq.c
@@ -79,6 +79,10 @@ static bool output_later_repeated;
 /* If true, ignore case when comparing.  */
 static bool ignore_case;
 
+/* Separator character used between output lines and the number of times they
+   occur in the input. */
+static char count_separator = ' ';
+
 enum delimit_method
 {
   /* No delimiters output.  --all-repeated[=none] */
@@ -136,12 +140,14 @@ static enum grouping_method grouping = GM_NONE;
 
 enum
 {
-  GROUP_OPTION = CHAR_MAX + 1
+  GROUP_OPTION = CHAR_MAX + 1,
+  COUNT_SEPARATOR_OPTION
 };
 
 static struct option const longopts[] =
 {
   {"count", no_argument, NULL, 'c'},
+  {"count-separator", required_argument, NULL, COUNT_SEPARATOR_OPTION},
   {"repeated", no_argument, NULL, 'd'},
   {"all-repeated", optional_argument, NULL, 'D'},
   {"group", optional_argument, NULL, GROUP_OPTION},
@@ -178,6 +184,9 @@ With no options, matching lines are merged to the first occurrence.\n\
 
      fputs (_("\
   -c, --count           prefix lines by the number of occurrences\n\
+      --count-separator=CHAR   separator character used between output lines\n\
+                                 and the number of times they occur in the\n\
+                                 input \n\
   -d, --repeated        only print duplicate lines, one for each group\n\
 "), stdout);
      fputs (_("\
@@ -308,7 +317,7 @@ writeline (struct linebuffer const *line,
     return;
 
   if (countmode == count_occurrences)
-    printf ("%7" PRIuMAX " ", linecount + 1);
+    printf ("%7" PRIuMAX "%c", linecount + 1, count_separator);
 
   fwrite (line->buffer, sizeof (char), line->length, stdout);
 }
@@ -483,6 +492,7 @@ main (int argc, char **argv)
   char const *file[2];
   char delimiter = '\n';	/* change with --zero-terminated, -z */
   bool output_option_used = false;   /* if true, one of -u/-d/-D/-c was used */
+  bool count_separator_used = false; /* true if --count-separator was used */
 
   file[0] = file[1] = "-";
   initialize_main (&argc, &argv);
@@ -594,6 +604,18 @@ main (int argc, char **argv)
                                   grouping_method_map);
           break;
 
+        case COUNT_SEPARATOR_OPTION:
+          /* Separator for count and line. Interpret --count-separator='' to
+          mean 'use NUL byte as * delimiter.' */
+          if (optarg[0] != '\0' && optarg[1] != '\0')
+            {
+              error (0, 0, _("The separator must be a single character"));
+              usage (EXIT_FAILURE);
+            }
+          count_separator = optarg[0];
+          count_separator_used = true;
+          break;
+
         case 'f':
           skip_field_option_type = SFO_NEW;
           skip_fields = size_opt (optarg,
@@ -656,6 +678,12 @@ main (int argc, char **argv)
       usage (EXIT_FAILURE);
     }
 
+  if (count_separator_used && countmode == count_none)
+    {
+      error (0, 0, _("--count-separator requires -c"));
+      usage (EXIT_FAILURE);
+    }
+
   check_file (file[0], file[1], delimiter);
 
   return EXIT_SUCCESS;
-- 
2.26.2

uniq: field separator for -c option

Reply via email to