Possible bug in 'sort -m'

Bob McGowan Tue, 13 Mar 2007 14:36:08 -0800

I ran into this as a result of working with the SQL UNION operator and trying to then confirm what it did/does by using 'uniq' and 'sort'.


So,for background, I first did:


  select count(from_number) from cross_reference

and

  select count(to_number) from cross_reference

and got 84919 in both cases.  Double that, gives 169838.

If I then do:

  select from_number from cross_reference
  union
  select to_number from cross_reference

I get 110256 rows. (There are two columns in the table, if one has data then the other must also, hence the equal counts). This means the union (as expected) is merging the two lists and eliminating duplicate values.

To confirm, I took all the lines found by each of the individual selects and put them in two files (named quite originally as from_number and to_number ;).


Each file has the expected number of total lines (84919).  I then did:

  sort -n -o from_number from_number

  sort -n -o to_number to_number

Still the same number of lines, only numerically sorted, now.

Then:

  uniq from_number | wc -l
  73609
  uniq to_number | wc -l
  48418

Adding these leaves 122027, too big by 12000+. Ah, I thinks to meself, some of the numbers in the two files can match each other between the files, but are unique in each file. So:


  sort -m from_number to_number | uniq | wc -l
  122010

This is still almost 12000 too big (only 17 less than the 'uniq' on the separate files). So, I run this:


  sort -u from_number to_number | wc -l

And I get 110256, the same number as the SQL UNION gave me.

So, if both files are sorted and I then use 'sort -m' followed by 'uniq' and count the results, shouldn't I get the same thing as resorting the two (already sorted) files with sort's '-u' option and counting that output?

I did wonder if I needed to use '-n' with the '-m', but that didn't fix anything, in fact, I got a different count: 121995.

Am I missing something obvious, having to do with numbers and merging? Or is this a bug in sort?


Thanks for your patience with the long post ;}

Bob

smime.p7s
Description: S/MIME Cryptographic Signature

Possible bug in 'sort -m'

Reply via email to