Thanks! That is very helpful in that it confirms my sanity. I didn't know about the locale stuff.
I have more info, as yesterday, I'd worked on getting around this issue (but hadn't found a complete solution yet). I have it all working consistently now. There are some interesting things I noted that may help explain things, since it *may*(?) not be entirely attributable to the sort strategies... - It seemed as if the sort error was fatal on linux, as it never output anything, though when I added `set euxo pipefail`, join on both systems output nothing - and join on both systems exited non-zero (according to the command echo output). - Adding `--nocheck-order` then allowed both OS's join to generate output and exit successfully, but the output on linux unexpectedly included the comment at the top of the file (which wasn't included in the macOS output, and caused problems downstream of the join command). Then once I added a `grep -v "#"` to the input files that join consumed, join on both systems started generating the same output. The upstream command (featureCounts) was reliably generating the same order output with the same number of lines, so we didn't really need the order check, otherwise, I would have added the locale setting you suggested - but it's definitely good to know. As a sample, here is the top of one of the input files that was going into the join command, and all input files only differed by column 7: ``` # Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "-T" "4" "-g" "peak" "-t" "peak" "-F" "GTF" "--fracOverlap" "0.3" "-a" "results/all_atac_peaks.gtf" "-o" "results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv" "results/sorted_atac_alignments/SRR17656980_19_60m_end.bam" Geneid Chr Start End Strand Length results/sorted_atac_alignments/SRR17656980_19_60m_end.bam peak1 19 9991 10388 . 398 22 peak2 19 13446 13578 . 133 1 peak3 19 18765 19285 . 521 1072 peak4 19 19695 19840 . 146 457 peak5 19 24499 24928 . 430 1086 peak6 19 33839 34320 . 482 169 peak7 19 34412 34567 . 156 36 peak8 19 34653 35197 . 545 212 ``` Rob Robert William Leach 133 Carl C. Icahn Lab Lewis-Sigler Institute for Integrative Genomics Princeton University Princeton, NJ 08544 > On Jul 11, 2023, at 5:48 AM, Pádraig Brady <p...@draigbrady.com> wrote: > > On 10/07/2023 22:26, Robert Leach via GNU coreutils General Discussion wrote: >> Hi, >> I wanted to ask about the `join` utility in `coreutils` 9.3. I'm building a >> snakemake workflow and am debugginbg an error that only occurs when the >> workflow is run on a linux system. I have narrowed the difference down to >> the `join` utility provided by the `coreutils` conda package. An error is >> produced on both systems, but since my script had not set `set -euxo >> pipefail`, the error was silent. On linux, this produced an error in the >> workflow rule that executes after the one that uses the join utility, >> because the input file was empty. >> So I manually ran the join command and noticed the difference in behavior on: >> macOS: >> ``` >> (coreutils) gen-rl-imac[2023-07-10 >> 17:01:59]:...CT-LOCAL/YURI/ATACC/REPOS/ATACCompendium$ join -1 1 -2 1 -o >> 1.1,1.7,2.7 -t ' ' >> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv >> >> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv >> Geneid results/sorted_atac_alignments/SRR17656980_19_60m_end.bam >> results/sorted_atac_alignments/SRR13509617_19_60m_end.bam >> peak1 22 28 >> peak2 1 12 >> peak3 1072 1637 >> peak4 457 942 >> peak5 1086 1507 >> peak6 169 67 >> peak7 36 85 >> peak8 212 198 >> join: >> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12: >> is not sorted: peak10 19 39038 39248 . 211 194 >> join: >> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:12: >> is not sorted: peak10 19 39038 39248 . 211 228 >> peak9 39 34 >> peak10 194 228 >> peak11 2178 2778 >> ... >> join: input is not in sorted order >> ``` >> and linux: >> ``` >> (coreutils) [rleach@argo-comp2 ATACCompendium]$ join -1 1 -2 1 -o >> 1.1,1.7,2.7 -t ' ' >> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv >> >> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv >> join: >> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12: >> is not sorted: peak10 19 39038 39248 . 211 194 >> join: >> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:2: >> is not sorted: Geneid Chr Start End Strand Length >> results/sorted_atac_alignments/SRR13509617_19_60m_end.bam >> join: input is not in sorted order >> ``` >> Is this a bug in either the macOS or linux versions of the coreutils join >> utility, a known issue, or what? > > Well the output from join(1) is giving ample clues > that the input files aren't sorted appropriately. > > Details: > > The above should be warnings and not impact the exit status of the join > process. > > The difference in output from Linux and MacOS is probably due to locale > settings. > Note how "Geneid" is the first disorder on your Linux system, which suggests > MacOS is using the C locale, while your Linux system is using en_US or > equivalent. > So you may get better consistency with the join --header option, > and that may be enough to address all your issues. > > If --header doesn't suffice, you may need to `LC_ALL=C sort -k1.5n` your > input files > before passing to join. > > If that doesn't suffice, you may get desired operation with the > --nocheck-order option. > > cheers, > Pádraig