Hi Nathalie,

On Mon, Jun 6, 2011 at 19:29, Nathalie Conte <n...@sanger.ac.uk> wrote:
> I need to remove the first 52 bp sequences reads in a fastq file,sequence is
> on line 2.
> fastq file from wikipedia:A FASTQ file normally uses four lines per
> sequence. Line 1 begins with a '@' character and is followed by a sequence
> identifier and an /optional/ description. Line 2 is the raw sequence
...
> #!/software/bin/perl
> use warnings;
> use strict;
>
>
> open (IN, "/file.fastq") or die "can't open in:$!";
> open (OUT, ">>newfile.txt") or die "can't open out: $!";
>
>   while (<IN>) {
> next unless (/^[A-Z]/);
>   my $new_line=substr($_,52);
>   print OUT $new_line;
>
> }


I frequently play with FASTQ data but basically Rob has said it all.
Since FASTQ has a fixed format, you should use that to your advantage
by taking in the 4 lines at a time and then processing them as needed.

However, what you have above does not work because the fourth line
(the quality scores) can also contain A-Z.  (Of course, if you are
trimming 52 bases from sequences, you probably want to trim 52 from
the quality scores, too.  But that's a separate issue...)

I would suggest you create a loop that loops through the file and
takes in the four lines.  Then check that line 1 starts with a @ and
line 3 starts with a +.  Then compare the lengths of line 2 and 4 to
make sure they're equal.  If all checks out, then do the trimming that
Rob suggests.

The FASTQ standard technically allows lines 2 and 4 to span multiple
lines -- so the sanity check above is a good idea if you want to make
your script flexible.  But sometimes, you may know for certain this
does not occur in your data; if so, then you can skip this sanity
check

Good luck!

Ray

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to