Martin, Martin Schmeing wrote: > Hi Bob, > Join works fine with my test smaller files, giving an appropriate > output. When both files are 1000 (short) lines long, it outputs > maybe one or two of the joined lines, but there should be hundreds > output. The files are sorted, and there is no error message given. > Here are my test files:
pcmodel.list pcmodel1000.list radmodel.list radmodel1000.list This one is tricky. At first pass it would seem that everything is in good shape for join. For example the input files to join must be sorted and not having them sorted is a common problem. But these are obvously sorted. The first thing I did was to check this. for f in *.list; do sort -c $f; done No errors from sort. All of the files were sorted. So I tried joining the larger files. join pcmodel1000.list radmodel1000.list 992 16023 239 3915 2793 43472.2226562 257.2904053 993 16023 240 4134 2889 44867.9531250 393.2121582 Two lines. What are in these files? The first 15 lines of the first file show the problem. But it is tricky. In fact I missed it until this point. 1 16021 1 834 6525 2 16021 2 1005 6699 3 16021 3 1296 6651 4 16021 4 1380 6594 5 16021 5 1188 6534 6 16021 6 1044 6363 7 16021 7 498 6240 8 16021 8 357 6405 9 16021 9 270 5886 10 16021 10 957 5436 11 16021 11 1122 6096 12 16021 12 1506 5865 13 16021 13 1407 6030 14 16021 14 1383 5922 15 16021 15 1533 6045 The first field is lined up with a variable number of spaces in the first column. That is the root of the issue here. Sort by default sorts the entire line using the character collating sequence specified by the LC_COLLATE locale. Join does the same but does so ignoring blanks at the start of the field. Because of the variable number of blanks sort and join are seeing a different sort order for the first field. Just last month (Feb 19 2008) James Youngman added a new feature to join that warns about this case. Using this very recent join the following diagnostic is printed. Eventually this will help people be made aware of this problem much more easily than with older versions of join. join: File 1 is not in sorted order join: File 2 is not in sorted order Knowing this makes it obvious that I used the wrong sort check. What I should have done was using -b to skip blanks to match what join is doing. Or more precisely 'sort -k 1b,1'. for f in *.list; do sort -c -k 1b,1 $f; done sort: pcmodel1000.list:10: disorder: 10 16021 10 957 5436 sort: radmodel1000.list:116: disorder: 1001 44867.9531250 393.2121582 Now the problem is much more apparent. The file needs to be sorted in the same order that join would expect it. Not numberically but lexically using 'sort -k 1b,1'. sort -k 1b,1 -o pcmodel1000.list pcmodel1000.list sort -k 1b,1 -o radmodel1000.list radmodel1000.list head -n10 1 16021 1 834 6525 10 16021 10 957 5436 100 16021 100 1764 714 1000 16023 247 4833 3609 101 16021 101 1932 588 102 16021 102 2058 501 103 16021 103 2418 399 104 16021 104 2256 447 105 16021 105 1644 849 Looks better for join even if it looks worse for humans. That is the ordering that is needed for character sorting. join pcmodel1000.list radmodel1000.list | wc -l 115 That looks a little more reasonable. Hope that explanation helped. Bob _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils