bug#36130: split bug

Heather Wick Fri, 07 Jun 2019 19:00:37 -0700

Hi,
Yes, sorry, I should have specified that I already checked that the
original fastq files are indeed paired and sorted with the same number of
lines and same starting/ending IDs, narrowing down the issue to a problem
with split.
~ Heather



(base) [hwick@zappalogin ~]$ zcat  MH2_R2.fastq.gz | wc -l

3778103832

(base) [hwick@zappalogin ~]$ zcat  MH2_R1.fastq.gz | wc -l

3778103832


(base) [hwick@zappalogin test_2019]$ zcat MH2_R1.fastq.gz | head -n8 | grep
^@

@A00197:48:HF2GWDMXX:1:1101:1741:1000 1:N:0:GATCAG+TCTTTCCC

@A00197:48:HF2GWDMXX:1:1101:2754:1000 1:N:0:GATCAG+TCTTTCCC

(base) [hwick@zappalogin test_2019]$ zcat MH2_R2.fastq.gz | head -n8 | grep
^@

@A00197:48:HF2GWDMXX:1:1101:1741:1000 2:N:0:GATCAG+TCTTTCCC

@A00197:48:HF2GWDMXX:1:1101:2754:1000 2:N:0:GATCAG+TCTTTCCC


(base) [hwick@zappalogin test_2019]$ zcat MH2_R1.fastq.gz | tail -n8 | grep
^@

@E00489:288:HMFWCCCXY:2:2224:29305:73106 1:N:0:GATCAG

@E00489:288:HMFWCCCXY:2:2224:29325:73106 1:N:0:GATCAG

(base) [hwick@zappalogin test_2019]$ zcat MH2_R2.fastq.gz | tail -n8 | grep
^@

@E00489:288:HMFWCCCXY:2:2224:29305:73106 2:N:0:GATCAG

@E00489:288:HMFWCCCXY:2:2224:29325:73106 2:N:0:GATCAG




On Fri, Jun 7, 2019 at 9:29 PM Assaf Gordon <assafgor...@gmail.com> wrote:

> Hello,
>
> On Fri, Jun 07, 2019 at 02:23:15PM -0400, Heather Wick wrote:
> > I am using split to split up some large, paired fastq files [...]:
> >
> >   zcat MH1_R1.fastq.gz | split - -l 40000000 DHT_R1_
> >   zcat MH1_R2.fastq.gz | split - -l 40000000 DHT_R2_
> >
> > This creates 96 chunks for the R1 and 95 chunks for R2, even though the
> > orignal fastq files have the same number of reads.
> >
> > Do you have any suggestions for how to proceed? Perhaps zcatting and
> piping
> > the files is not the best way to call split?
>
> To help diagnose to issue better, please run the following commands
> and tell us what are the results:
>
> 1. number of lines in each file:
>
>    zcat MH1_R1.fastq.gz | wc -l
>    zcat MH1_R2.fastq.gz | wc -l
>
> 2. The first two sequence IDs:
>
>    zcat MH1_R1.fastq.gz | head -n8 | grep ^@
>    zcat MH1_R2.fastq.gz | head -n8 | grep ^@
>
> 3. Last two sequence IDs:
>
>    zcat MH1_R1.fastq.gz | tail -n8 | grep ^@
>    zcat MH1_R2.fastq.gz | tail -n8 | grep ^@
>
> These will just verify the FASTQ files are indeed paired with no
> surprises. The files should have the same number of lines,
> and matching sequence IDs in the first and last lines.
>
> regards,
>  - assaf
>
>

-- 
Heather Wick
PhD Candidate, Human Genetics
Labs of Sarah Wheelan and Vasan Yegnasubramanian
Institute of Genetic Medicine
Johns Hopkins University School of Medicine
hwi...@jhmi.edu

bug#36130: split bug

Reply via email to