On 10/07/2023 22:26, Robert Leach via GNU coreutils General Discussion wrote:
Hi,

I wanted to ask about the `join` utility in `coreutils` 9.3.  I'm building a 
snakemake workflow and am debugginbg an error that only occurs when the 
workflow is run on a linux system.  I have narrowed the difference down to the 
`join` utility provided by the `coreutils` conda package.  An error is produced 
on both systems, but since my script had not set `set -euxo pipefail`, the 
error was silent.  On linux, this produced an error in the workflow rule that 
executes after the one that uses the join utility, because the input file was 
empty.

So I manually ran the join command and noticed the difference in behavior on:

macOS:
```
(coreutils) gen-rl-imac[2023-07-10 
17:01:59]:...CT-LOCAL/YURI/ATACC/REPOS/ATACCompendium$ join -1 1 -2 1 -o 
1.1,1.7,2.7 -t '    ' 
.tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv
 
.tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv
Geneid  results/sorted_atac_alignments/SRR17656980_19_60m_end.bam       
results/sorted_atac_alignments/SRR13509617_19_60m_end.bam
peak1   22      28
peak2   1       12
peak3   1072    1637
peak4   457     942
peak5   1086    1507
peak6   169     67
peak7   36      85
peak8   212     198
join: 
.tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12:
 is not sorted: peak10     19      39038   39248   .       211     194
join: 
.tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:12:
 is not sorted: peak10     19      39038   39248   .       211     228
peak9   39      34
peak10  194     228
peak11  2178    2778

...

join: input is not in sorted order
```

and linux:
```
(coreutils) [rleach@argo-comp2 ATACCompendium]$ join -1 1 -2 1 -o 1.1,1.7,2.7 
-t '      ' 
.tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv
 
.tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv
join: 
.tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12:
 is not sorted: peak10     19      39038   39248   .       211     194
join: 
.tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:2:
 is not sorted: Geneid      Chr     Start   End     Strand  Length  
results/sorted_atac_alignments/SRR13509617_19_60m_end.bam
join: input is not in sorted order
```

Is this a bug in either the macOS or linux versions of the coreutils join 
utility, a known issue, or what?

Well the output from join(1) is giving ample clues
that the input files aren't sorted appropriately.

Details:

The above should be warnings and not impact the exit status of the join process.

The difference in output from Linux and MacOS is probably due to locale 
settings.
Note how "Geneid" is the first disorder on your Linux system, which suggests
MacOS is using the C locale, while your Linux system is using en_US or 
equivalent.
So you may get better consistency with the join --header option,
and that may be enough to address all your issues.

If --header doesn't suffice, you may need to `LC_ALL=C sort -k1.5n` your input 
files
before passing to join.

If that doesn't suffice, you may get desired operation with the --nocheck-order 
option.

cheers,
Pádraig

Reply via email to