On 10/07/2023 22:26, Robert Leach via GNU coreutils General Discussion wrote:
Hi,
I wanted to ask about the `join` utility in `coreutils` 9.3. I'm building a
snakemake workflow and am debugginbg an error that only occurs when the
workflow is run on a linux system. I have narrowed the difference down to the
`join` utility provided by the `coreutils` conda package. An error is produced
on both systems, but since my script had not set `set -euxo pipefail`, the
error was silent. On linux, this produced an error in the workflow rule that
executes after the one that uses the join utility, because the input file was
empty.
So I manually ran the join command and noticed the difference in behavior on:
macOS:
```
(coreutils) gen-rl-imac[2023-07-10
17:01:59]:...CT-LOCAL/YURI/ATACC/REPOS/ATACCompendium$ join -1 1 -2 1 -o
1.1,1.7,2.7 -t ' '
.tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv
.tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv
Geneid results/sorted_atac_alignments/SRR17656980_19_60m_end.bam
results/sorted_atac_alignments/SRR13509617_19_60m_end.bam
peak1 22 28
peak2 1 12
peak3 1072 1637
peak4 457 942
peak5 1086 1507
peak6 169 67
peak7 36 85
peak8 212 198
join:
.tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12:
is not sorted: peak10 19 39038 39248 . 211 194
join:
.tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:12:
is not sorted: peak10 19 39038 39248 . 211 228
peak9 39 34
peak10 194 228
peak11 2178 2778
...
join: input is not in sorted order
```
and linux:
```
(coreutils) [rleach@argo-comp2 ATACCompendium]$ join -1 1 -2 1 -o 1.1,1.7,2.7
-t ' '
.tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv
.tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv
join:
.tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12:
is not sorted: peak10 19 39038 39248 . 211 194
join:
.tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:2:
is not sorted: Geneid Chr Start End Strand Length
results/sorted_atac_alignments/SRR13509617_19_60m_end.bam
join: input is not in sorted order
```
Is this a bug in either the macOS or linux versions of the coreutils join
utility, a known issue, or what?
Well the output from join(1) is giving ample clues
that the input files aren't sorted appropriately.
Details:
The above should be warnings and not impact the exit status of the join process.
The difference in output from Linux and MacOS is probably due to locale
settings.
Note how "Geneid" is the first disorder on your Linux system, which suggests
MacOS is using the C locale, while your Linux system is using en_US or
equivalent.
So you may get better consistency with the join --header option,
and that may be enough to address all your issues.
If --header doesn't suffice, you may need to `LC_ALL=C sort -k1.5n` your input
files
before passing to join.
If that doesn't suffice, you may get desired operation with the --nocheck-order
option.
cheers,
Pádraig